kamran Afzali's PortfolioQuantitative psychologist and data-scientist keen to develop intelligent assessment and decision-making tools in the context of human behavior.
https://kamran-afzali.github.io/
Wed, 14 Aug 2024 18:18:56 +0000Wed, 14 Aug 2024 18:18:56 +0000Jekyll v3.10.0Synthetic Data in Health Care<p><strong>Why do we need anonymization in healthcare?</strong></p>
<p>Due to the sensitive nature of health data and the regulatory requirements surrounding its use, data anonymization is an important issue in healthcare informatics. The procedure involves the process of removing or obscuring personally identifiable information (PII) from datasets. This ensures that individuals cannot be identified, allowing the data to be used for analysis, research, and other usecases without compromising privacy. Not only, this procedure is essential for maintaining patient trust and adhering to ethical standards, but, many data protection regulations, such as the European Union’s General Data Protection Regulation (GDPR), the Health Insurance Portability and Accountability Act (HIPAA) in the United States and The Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada, mandate the some type of anonymization of personal data to protect privacy. Compliance with these regulations is crucial to avoid legal penalties and maintain public trust.</p>
<p>Anonymized data also enhances data security by reducing the risk of data breaches and identity theft. Even if unauthorized parties access anonymized data, the absence of PII makes it less valuable and less likely to be misused. Furthermore, anonymization facilitates data sharing across different departments, organizations, and even countries without compromising individual privacy. This is particularly important in healthcare, where data sharing can lead to significant advancements in medical research and public health. Health organizations can leverage large datasets to gain insights, improve services, and innovate while respecting privacy. Proper anonymization techniques ensure that the data remains useful for analysis and decision-making.</p>
<p>Various techniques are employed to anonymize health data, each with its strengths and applications. <strong>Data Masking</strong> involves creating a modified version of sensitive data, accessible either in real-time (dynamic data masking) or through a duplicate database with anonymized data (static data masking). Techniques for data masking include encryption, shuffling terms or characters, and dictionary substitution, ensuring that sensitive information is hidden while maintaining data usability. <strong>Pseudonymization</strong> replaces private identifiers with pseudonyms or false identifiers to ensure data confidentiality while preserving the statistical accuracy of the dataset. For example, the name “David Adams” might be replaced with “John Smith,” keeping the data usable for analysis but protecting the individual’s identity. <strong>Generalization</strong> involves omitting or broadening certain data to make it less identifiable. For example, a specific age number might be replaced with a range. This technique aims to remove specific identifiers without compromising the overall accuracy of the data. <strong>Data Swapping</strong> rearranges dataset attribute values so they no longer match the original information. This can include switching columns with recognizable values like dates of birth, effectively anonymizing the data by ensuring that direct identifiers are not associated with the original records. <strong>Data Perturbation</strong> slightly alters the original dataset by applying methods such as rounding or adding random noise. The extent of the perturbation must be carefully balanced; if the base for modification is too small, the data might not be sufficiently anonymized, while too large a base can make the data unusable. <strong>Synthetic Data</strong> generates entirely new datasets that do not relate to any real individuals or cases. This method uses algorithms and mathematical models to produce data based on patterns and features from the original dataset. Techniques such as linear regression, standard deviations, and medians help create synthetic data that retains the statistical properties of the original data without compromising privacy.</p>
<p>Although in healthcare data anonymization has significant applications, challenges remain, such as the risk of re-identification, balancing privacy and utility, evolving regulations, and technological advancements. In terms of balancing privacy and utility over-anonymization can make data useless, while under-anonymization can compromise privacy. Organizations must carefully calibrate their techniques to maintain this balance. Additionally, data protection regulations are continually evolving, requiring organizations to stay updated with the latest requirements to ensure compliance, which can be resource-intensive. Lastly, technological advancements continually introduce new methods for re-identifying anonymized data. Organizations must stay ahead of these developments and adapt their strategies accordingly to ensure ongoing data privacy and security. Despite the challenges, the benefits of data anonymization in terms of privacy protection, regulatory compliance, data security, and enabling research and analysis make it an essential practice in today’s data-driven world. In this context, a more robust approach to safeguarding privacy in data analysis is emerging: <strong>differential privacy</strong>. This advanced concept goes beyond traditional anonymization techniques by providing strong mathematical guarantees against re-identification.</p>
<p><strong>Differential privacy</strong></p>
<p><strong>Differential privacy</strong> is a mathematical framework designed to ensure that the privacy of individuals within datasets is maintained while still allowing for meaningful data analysis. It achieves this by introducing controlled noise or randomness into the data that masks the presence or absence of any individual’s data, thereby preserving overall statistical properties and insights that can be derived from the dataset as a whole. The fundamental principle of differential privacy is to ensure that the probability of any output, such as statistical results or machine learning models, is essentially the same whether an individual’s data is included or not. This is achieved by carefully calibrating the amount of random noise added, based on the sensitivity of the computation being performed. A key aspect of differential privacy is the quantifiable privacy loss parameter, epsilon (ε), which allows a trade-off between privacy and data accuracy or utility. Smaller values of epsilon indicate stronger privacy guarantees but can result in less accurate data outputs.</p>
<p>Differential privacy is robust to auxiliary information, meaning that even if an attacker has additional information, the privacy of individuals within the dataset is still protected against re-identification attacks. Additionally, differential privacy maintains its guarantee even after post-processing the output, ensuring that any further manipulation of the data does not compromise privacy. There are two main types of differential privacy: global differential privacy (GDP) and local differential privacy (LDP). Global differential privacy applies noise to the output of an algorithm that operates on the entire dataset, providing a strong privacy guarantee for all individuals in the dataset. Local differential privacy, on the other hand, applies noise to each individual data point before it is collected or sent to an algorithm.</p>
<p>In healthcare, differential privacy has applications where healthcare organizations can share aggregate statistics about patients, such as the prevalence of certain diseases or average treatment costs, without revealing personal information about individual patients. The noise added through differential privacy masks the contribution of any single patient record, ensuring privacy while allowing for valuable insights. Likewise, differential privacy also enables the analysis of healthcare utilization patterns and costs across populations without compromising individual patient confidentiality. In the same vein, researchers can perform analyses on sensitive medical datasets, such as electronic health records or genomic data, using differential privacy to protect patient privacy while still enabling valuable insights. The added noise prevents re-identification attacks, ensuring that sensitive patient information remains confidential. Differential privacy also supports generating high-quality synthetic datasets that mimic real health data for research, testing, or training purposes. These synthetic datasets allow researchers to conduct studies and develop algorithms without exposing actual patient information, providing a valuable resource for advancing healthcare technologies while maintaining privacy.</p>
<p>Despite its advantages, differential privacy faces several challenges, particularly in the healthcare domain. Communicating the level of privacy guaranteed by differential privacy in a clear and understandable way is a challange for gaining public trust and compliance. One significant issue is the theoretical nature of the privacy parameter epsilon, which can be difficult to explain to patients and stakeholders. Another challenge is the trade-off between utility and privacy. Adding noise to the data can diminish the accuracy of data analysis, especially in small datasets. Balancing the need for accurate insights with the need to protect individual privacy is a critical aspect of implementing differential privacy effectively. Moreover, real-world implementations of differential privacy in health research are still relatively rare. More development, case studies, and evaluations are needed to demonstrate its practical applications and effectiveness.</p>
<p>As the field continues to evolve, addressing the challenges and limitations of differential privacy will be crucial for its broader adoption and effectiveness in protecting privacy across various domains, including healthcare. By continuously improving and adapting differential privacy methods, we can ensure that the benefits of data analysis and innovation are realized without compromising individual privacy. Differential privacy offers a principled and practical way to balance the trade-off between utility and privacy. It provides a clear and meaningful definition of privacy, a rigorous and quantifiable measure of privacy loss, and a flexible and adaptable framework for designing privacy-preserving algorithms. Its applications in healthcare are particularly promising, enabling insights from data while providing a rigorous, mathematically provable privacy guarantee to protect sensitive patient information.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="http://www.tdp.cat/issues11/tdp.a129a13.pdf">Practicing Differential Privacy in Health Care: A Review </a></li>
<li><a href="https://becominghuman.ai/what-is-differential-privacy-1fd7bf507049?gi=47f4ffff5342">What is Differential Privacy?</a></li>
<li><a href="https://blog.cloudflare.com/have-your-data-and-hide-it-too-an-introduction-to-differential-privacy">Have your data and hide it too: an introduction to differential privacy</a></li>
<li><a href="https://blog.openmined.org/basics-local-differential-privacy-vs-global-differential-privacy/">Global or Local Differential Privacy?</a></li>
<li><a href="https://blog.pangeanic.com/6-personal-data-anonymization-techniques">6 Personal Data Anonymization Techniques You Should Know About</a></li>
<li><a href="https://blog.pangeanic.com/synthetic-data-vs-anonymized-data">Synthetic Data vs Anonymized Data</a></li>
<li><a href="https://corporatefinanceinstitute.com/resources/business-intelligence/data-anonymization/">Data Anonymization</a></li>
<li><a href="https://desfontain.es/blog/friendly-intro-to-differential-privacy.html">A friendly, non-technical introduction to differential privacy</a></li>
<li><a href="https://desfontain.es/blog/local-global-differential-privacy.html">Local vs. central differential privacy </a></li>
<li><a href="https://enlitic.com/blogs/deidentifying-and-anonymizing-healthcare-data/">Deidentifying and Anonymizing Healthcare Data</a></li>
<li><a href="https://geninvo.com/importance-and-examples-of-usage-of-data-anonymization-in-healthcare-other-sectors/">Importance and examples of usage of Data Anonymization in Healthcare & Other sectors</a></li>
<li><a href="https://geninvo.com/the-what-and-why-of-clinical-data-anonymization/">The “What” and “Why” of Clinical Data Anonymization</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10020182/">An anonymization-based privacy-preserving data collection protocol for digital health data</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10257172/">A Survey on Differential Privacy for Medical Data Analysis</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6658290/">Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review </a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8449619/">Differential privacy in health research: A scoping review</a></li>
</ul>
Tue, 30 Jul 2024 00:00:00 +0000
/posts/2024-07-30/anon.html
Differential PrivacySynthetic DataDigitalHealthpostsFree Energy Principle and Emotional Valence<h3 id="free-energy-principle-and-emotional-valence">Free Energy Principle and Emotional Valence</h3>
<p>This is the third post in our series on Free Energy Principle (FEP) as a unified account of perception, learning, and action, with potential applications in emotion synamics. In this post we are going to discuss porposition by <a href="https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003094&type=printable">Joffily (2013)</a> to understand emotional valence in terms of the negative rate of change of free energy over time, suggesting a dynamic interaction between valence and learning rate. As discussed earlier, according to the free-energy principle, which posits that adaptive agents must minimize their free-energy to resist disorder, biological agents encode probabilistic models of the causes of their sensations. This principle is grounded in the understanding that minimizing free-energy can be interpreted as minimizing the prediction error between actual and predicted sensory inputs, especially under Gaussian assumptions. Adaptive agents achieve this minimization through two main tactics: adjusting their internal states for more accurate predictions and acting on the environment to sample sensations that align with their predictions. Importantly, perceptual inference, perceptual learning, and active inference all rely on the same Bayesian scheme, wherein agents infer the causes of sensory inputs, learn the relationship between inputs and causes, and act on the world to fulfill prior expectations about sensory inputs, respectively.</p>
<p>In the process of inferring and learning the causes of their sensations in a dynamic environment, adaptive agents encounter various forms of uncertainty: estimation uncertainty, volatility, and unexpected uncertainty. Estimation uncertainty is the known estimation variance of states of the world causing sensory inputs and can be mitigated through learning. Volatility refers to slow and continuous changes in states of the world, often modeled by linking estimation uncertainty to a latent stochastic process. Unexpected uncertainty arises from surprising sensory inputs caused by discrete and rapid changes in states of the world, necessitating the resetting of learning from new sensory data. Dealing with these forms of uncertainty is crucial for Bayesian models of learning in non-stationary environments, presenting a significant challenge for dynamically updating beliefs about the world to optimize predictions. Despite the acknowledged influence of emotions on beliefs and their resistance to change, computational models have largely overlooked or failed to integrate emotional aspects. Emotional valence, representing the positive or negative character of emotion covers aspects such as subjective experiences and expressive behaviors, with valence being considered a core dimension of subjective experiences of moods and emotions. Here we use our understanding of FEP and its relation to uncertainty, combined with the recognition of emotional valence as a critical aspect of subjective experiences, sets the stage for integrating emotions into computational models of adaptive agents. Such integration holds promise for a more comprehensive understanding of human behavior and cognition in dynamic environments.</p>
<p>To bridge the gap between the biological principle of minimizing free-energy in agents at equilibrium with their environment and the traditional understanding of human behavior driven by pleasure-seeking and pain-avoidance, here we discuss a framework that links positive and negative emotional experiences to changes in free-energy over time. In this continuous time domain, the rate of change of free-energy $F(t)$ is represented by its first time-derivative $F’(t)$ at time $t$. We formally define the valence of a state visited by an agent at time $t$ as the negative first time-derivative of free-energy at that state, or simply $F’(t)$. Adaptive agents encode a hierarchical generative model of the causes of their sensations, where states of increasing complexity are encoded in higher levels of the hierarchy and sensory data are encoded at the lowest level. Free-energy is minimized independently for each level of the hierarchy, and $F_i(t)$ represents the free-energy associated with the hidden state at the $i$-th level. According to our definition of emotional valence, when $F’_i(t)$ is positive (indicating an increase in free-energy over time at level $i$ of the hierarchy), the valence of the state at this level is negative at time $t$. Conversely, when $F’_i(t)$ is negative (indicating a decrease in free-energy over time at level $i$), the valence of the state is positive at time $t$. Neutral valenced states, where $F’_i(t)$ is zero, may also exhibit low or high levels of surprise, as free-energy serves as an upper bound on surprise.</p>
<p>Cognitive theories of emotion have often relied on beliefs about states of affairs for their analyses. Emotions like happiness, unhappiness, relief, and disappointment are associated with certain (firm) beliefs about states of affairs, while emotions like hope and fear are related to uncertain beliefs. These two classes of emotions have been termed factive and epistemic, respectively. To illustrate the difference between factive and epistemic emotions, consider the example of Kamran waiting for a train. Kamran is happy if he desires the train to be on time and firmly believes it is, unhappy if he doesn’t desire it to be on time and firmly believes it is, hopeful if he desires it to be on time but is uncertain, and fearful if he doesn’t desire it to be on time but is uncertain. Relief and disappointment, on the other hand, are associated with transitions from uncertain to certain beliefs. In this framework, beliefs and desires can be related to bottom-up conditional expectations and top-down predictions, respectively, in a predictive coding scheme of free-energy minimization. Avoiding the assumption of certain beliefs inherent in cognitive theories, we focus solely on the dynamics of free-energy, showing that factive and epistemic emotions are associated with low and high levels of uncertainty, respectively. In the continuous time domain, the rate of change of the first derivative of free-energy $F’_i(t)$ at the $i$-th level is the second time-derivative of free-energy. Analogously to mechanical physics, $F’_i(t)$ and $F’‘_i(t)$ represent the velocity and acceleration of free-energy at time $t$, respectively. We propose that when both $F’_i(t)$ and $F’‘_i(t)$ are negative, indicating a decrease in free-energy ‘faster and faster’ over time, the agent hopes to visit a state of lower free-energy in the near future at level $i$. Conversely, when $F’_i(t)$ is negative and $F’‘_i(t)$ is positive, the agent is happy to be visiting a state of lower free-energy than before at this level. Similarly, when both $F’_i(t)$ and $F’‘_i(t)$ are positive, indicating an increase in free-energy ‘faster and faster’ over time, the agent fears visiting a state of greater free-energy in the near future. However, when $F’_i(t)$ is positive and $F’‘_i(t)$ is negative, the agent is unhappy to be visiting a state of higher free-energy than before.</p>
<p>Transitions between emotional states follow a pattern, with transitions from negative to positive emotions passing through relief and transitions from positive to negative emotions passing through disappointment. In other words, each basic emotion is mapped onto a particular region of a two-dimensional space defined by the first and second time-derivatives of free-energy, illustrating the complex relationship between affective states and free-energy dynamics. In this post we discussed the integration of emotional valence and basic forms of emotion into the FEP framework that is originally developed to study perception, learning, and action. <strong>Valence, computed as the negative rate of change of free-energy</strong>, serves as a vital indicator for biological agents, informing them about unexpected changes in their environment. A positive valence suggests that sensory inputs align with the agent’s expectations, indicating a low probability of unexpected changes. Conversely, a negative valence signifies that the agent’s expectations are violated, signaling likely unexpected changes in the environment. In dynamic environments where recent information is a better predictor of the world’s states than past information, such as in changing scenarios, recent information should be weighted more heavily. This necessitates a high learning rate to adapt quickly to new information. In contrast, for stationary environments where both past and recent information are equally informative, a low learning rate is more suitable the significance of both past and recent data. In the context of FEP it is possible to formalize the emotional meta-learning, where the estimation uncertainty is not only determined by the <em>surprise</em> per se but also by the rate of change of <em>surprise</em>. Specifically, when the free-energy associated with posterior beliefs about states at a particular level in the agent’s hierarchical model increases, the posterior certainty about these states decreases. This implies that decreasing evidence for the agent’s estimates of states of the world indicates excessive confidence in those states. Emotional regulation of uncertainty is framed as meta-learning to emphasize that learning is influenced by the consequences of this adjustment, particularly the rate of change of variational free-energy. Importantly, the emotional meta-learning is not tied to any specific generative model, as expectations about states are optimized with respect to variational free-energy either at an evolutionary timescale or during experience-dependent learning. The emotional update presented here it aligns with several key heuristics in optimization literature, particularly regularization schemes. These schemes, such as <em>Levenberg Marquardt Regularization</em>, decrease the learning rate or gradient descent when the objective function being optimized does not change as expected. Typically, this regularization adjusts the relative precision of the data, making more cautious updates in response to adverse changes in the objective function, such as the free energy in our scheme. Importantly, in a hierarchical setting, this adaptation in the rate of optimization or learning at various levels of the hierarchical model can lead to adaptive changes in the agent’s behavior and perception.</p>
<p>To this point we have try to link emotional valence to changes in free energy over time, suggesting that decreasing free energy induces positive emotions, while increasing it induces negative ones. In the same vein, dual-process theories propose that cognition comprises implicit and explicit processes. Neuroscientific studies have shown temporally separated phases of emotional processing, with rapid, automatic processing preceding slower, explicit processing. The pleasure-interest model of aesthetic liking (PIA model) incorporates processing fluency, suggesting that fluently processed stimuli induce pleasure, while disfluency reduction induces interest. This fluency-disfluency paradigm explains preferences for simple, typical stimuli over complex, novel ones. It aligns with theories of cognitive ease and the default mode brain network associated with fast thinking. Recent research propose a novel mathematical framework that applies free energy dynamics to the fluency-disfluency paradigm. This framework adopts variational Bayesian inference to model perceptions in the dual process and uses variations in free energy to represent emotions like <em>pleasure, interest, confusion, and boredom</em>. By focusing on Bayesian priors, this framework formalize emotions in the second explicit process based on changes in free energy. We apply this framework to a Gaussian Bayesian model, analyzing the effects of various parameters on emotions.</p>
<p>The pleasure-interest model of aesthetic liking models emotions in the dual process using the concept of processing fluency. The theory of fluency roposes that the experience of processing fluency of stimuli directly feels good on an affective level. Similar effects can also be confirmed as the link between typicality and preferences and have been demonstrated in various visual stimuli: human faces, painting/patterns, and artifacts/natural entities. The PIA model suggests that the processing fluency of stimuli in the first automatic process (e.g., the first impression of an aesthetic object) induces emotions such as pleasure and displeasure, and the disfluency reduction in the second controlled process (e.g., understanding complex and novel object or the resolution of conflict) induces emotions such as interest and confusion. The positive effect of disfluency reduction has also been confirmed as the aesthetic “Aha” effect or impact of the perceptual insight in the field of psychological aesthetics. Such a fluency–disfluency paradigm explains the aesthetic preferences discrepancy between fluently processed objects, such as simple and typical objects, and difficult-to-process (disfluent) objects such as complex novel objects. The idea of processing fluency in dual processes is also consistent with the theory that fast thinking is subject to “cognitive ease” and that individuals tend to think, choose, and act spontaneously according to associative principles that are easy to understand and process. It has also been shown that a network of multiple brain regions, called the default mode brain network, which includes the ventromedial prefrontal cortex and anterior and posterior cingulate cortex, is associated with the neural basis of fast thinking available in a state of high cognitive fluency.</p>
<p>As mentioned above, the reduction in free energy during the automatic process induces positive emotions due to the decrease in uncertainty, this reduction in free energy is associated with pleasure according to the Pleasure-Interest-Affect (PIA) model where we interpret the reduction in free energy as eliciting positive emotions such as pleasure. In the PIA model, three emotional states of <strong>interest, confusion,</strong> and <strong>boredom</strong> are proposed during the transition between automatic and controlled processes. Successful reduction of disfluency in the controlled process leads to the experience of “interest.” Conversely, failure to reduce disfluency results in the perception of “confusion” as a negative emotion. When no disfluency is present, individuals may experience “boredom.” Interest” is defined as the occurrence of both a free energy increase (disfluency) and its reduction. As the increase in free energy (disfluency) increases its reduction, the greater the intital increase free energy the higher is the intial level of interest. “Confusion” is persued in the case where free energy increase (disfluency) occurs but its reduction does not occur. “Boredom” is defined as the case where the increase in free energy (disfluency) does not occur; hence, its reduction does not occur.</p>
<p>The integration of the Free Energy Principle with emotional valence provides a promising framework for understanding the interplay between cognition and affective experiences. Moreover, the incorporation of dual-process theories and the Pleasure-Interest-Affect model has deepened our understanding of emotional processing, highlighting the importance of processing fluency in shaping affective responses. The fluency-disfluency paradigm not only explains preferences for certain stimuli but also sheds light on the cognitive mechanisms underlying aesthetic liking and decision-making.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003094&type=printable">Emotional Valence and the Free-Energy Principle</a></li>
<li><a href="https://www.sciencedirect.com/science/article/abs/pii/S0893608022004233">Free energy model of emotional valence in dual-process perceptions</a></li>
<li><a href="https://www.researchgate.net/publication/356396055_Free-Energy_Model_of_Emotion_Potential_Modeling_Arousal_Potential_as_Information_Content_Induced_by_Complexity_and_Novelty">Free-Energy Model of Emotion Potential: Modeling Arousal Potential as Information Content Induced by Complexity and Novelty</a></li>
<li><a href="https://www.researchgate.net/publication/361266307_Metacognitive_Feelings_A_Predictive_Processing_Perspective">Metacognitive Feelings: A Predictive-Processing Perspective</a></li>
</ul>
<p><a href="https://www.semanticscholar.org/paper/An-Investigation-of-the-Free-Energy-Principle-for-Demekas-Parr/9b167315c4b539824dbfae03d62f7eb9e3fc66cb"></a>
<a href="https://www.semanticscholar.org/paper/Self-supervision%2C-normativity-and-the-free-energy-Hohwy/b86fb21949e2b2a1c57913caf70bbfdf66d780b4"></a>
<a href="https://www.semanticscholar.org/paper/The-Hard-Problem-of-Consciousness-and-the-Free-Solms/b017d5ba90e05de3fd0567b1c47ff7ec63a5580f"></a>
<a href="https://www.semanticscholar.org/paper/Expected-Free-Energy-Formalizes-Conflict-Underlying-Connolly/0905d6d8b3713e5c9b9630f42d7af41daf29203e"></a></p>
Sun, 30 Jun 2024 00:00:00 +0000
/posts/2024-06-30/fep_emotion.html
Free Energy PrincipleBayesian BrainEmotional ValencepostsBayesian Ordinal and Multinomial Regression Models<h3 id="bayesian-ordinal-and-multinomial-regression-models">Bayesian Ordinal and Multinomial Regression Models</h3>
<p>Bayesian modeling offers a powerful framework for handling ordered categorical and multinomial outcomes in a variety of contexts. When dealing with ordered categorical outcomes, such as survey responses or Likert scale ratings, Bayesian methods can be applied to fit ordinal logistic regression models. In this approach, the cumulative probabilities of each ordinal category are modeled relative to the predictor variables. By incorporating prior information about these relationships, Bayesian ordinal logistic regression provides a flexible and robust approach for modeling ordered categorical outcomes. Similarly, Bayesian modeling with Stan can also be applied to multinomial outcomes, where the outcome variable has more than two categories. Multinomial logistic regression models can be fitted using Stan, wherein the probabilities of each category relative to a reference category are modeled as a function of the predictor variables. This approach is particularly useful in settings where the outcome variable represents multiple mutually exclusive categories, such as different types of diseases or customer preferences. With Stan, researchers can specify complex multinomial logistic regression models that account for uncertainty in the parameter estimates and incorporate prior beliefs about the relationships between predictors and the outcome categories.</p>
<p>Bayesian methods allow for the incorporation of uncertainty quantification and model comparison techniques. Uncertainty quantification is essential in Bayesian modeling as it provides estimates of uncertainty in model parameters, allowing researchers to make more informed decisions and interpretations. Stan facilitates the calculation of credible intervals for model parameters, providing insights into the range of plausible values. Additionally, model comparison techniques such as Bayesian Information Criterion (BIC) or leave-one-out cross-validation (LOO-CV) can be used to compare the fit of different models and aid in model selection. This enables researchers to identify the most appropriate model for their data, considering both goodness-of-fit and model complexity. Here we present two examples of Stan code defines multinomial and ordinal logistic regression models, where a predictor matrix <code class="language-plaintext highlighter-rouge">x</code> is used to predict the categorical outcome variable <code class="language-plaintext highlighter-rouge">y</code>. The model estimates a matrix of coefficients <code class="language-plaintext highlighter-rouge">beta</code>, and the likelihood function relates the predictor variables to the outcome categories.</p>
<h4 id="multinomial-logistic-regression-example">multinomial logistic regression example</h4>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rstan</span><span class="p">)</span><span class="w">
</span><span class="n">stan_code</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'
data {
int K;
int N;
int D;
int y[N];
matrix[N, D] x;
}
parameters {
matrix[D, K] beta;
}
model {
matrix[N, K] x_beta = x * beta;
to_vector(beta) ~ normal(0, 5);
for (n in 1:N)
y[n] ~ categorical_logit(x_beta[n]'</span><span class="p">);</span><span class="w">
</span><span class="p">}</span><span class="err">'</span><span class="w">
</span><span class="n">stan_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">model_code</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_code</span><span class="p">)</span><span class="w">
</span><span class="c1"># Generate simulated data</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">3</span><span class="w">
</span><span class="n">K</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="n">N</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">D</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">D</span><span class="p">)</span><span class="w">
</span><span class="n">beta_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="n">D</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">K</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">K</span><span class="p">)</span><span class="w">
</span><span class="n">eta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_true</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">eta</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">eta_i</span><span class="p">)</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="m">1</span><span class="o">:</span><span class="n">K</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">eta_i</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">eta_i</span><span class="p">))))</span><span class="w">
</span><span class="c1"># Prepare data for Stan</span><span class="w">
</span><span class="n">stan_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">D</span><span class="p">,</span><span class="w">
</span><span class="n">K</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">K</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">stan_model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_data</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fit</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p><strong>Data Block:</strong></p>
<div class="language-stan highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">data</span> <span class="p">{</span>
<span class="kt">int</span> <span class="nv">K</span><span class="p">;</span> <span class="c1">// Number of categories or classes</span>
<span class="kt">int</span> <span class="nv">N</span><span class="p">;</span> <span class="c1">// Number of observations</span>
<span class="kt">int</span> <span class="nv">D</span><span class="p">;</span> <span class="c1">// Number of predictors or features</span>
<span class="kt">int</span> <span class="nv">y</span><span class="p">[</span><span class="nv">N</span><span class="p">];</span> <span class="c1">// Outcome variable, an array of length N containing the category indices</span>
<span class="kt">matrix</span><span class="p">[</span><span class="nv">N</span><span class="p">,</span> <span class="nv">D</span><span class="p">]</span> <span class="nv">x</span><span class="p">;</span> <span class="c1">// Predictor matrix, containing the predictor values for each observation</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the data block, we declare the variables used in the model and specify their dimensions and types. Here, <code class="language-plaintext highlighter-rouge">K</code> represents the number of categories or classes, <code class="language-plaintext highlighter-rouge">N</code> is the number of observations, <code class="language-plaintext highlighter-rouge">D</code> is the number of predictors, <code class="language-plaintext highlighter-rouge">y</code> is an array of length <code class="language-plaintext highlighter-rouge">N</code> containing the category indices (each entry corresponds to the category of the respective observation), and <code class="language-plaintext highlighter-rouge">x</code> is a matrix of size <code class="language-plaintext highlighter-rouge">N</code>-by-<code class="language-plaintext highlighter-rouge">D</code> containing the predictor values for each observation.</p>
<p><strong>Parameters Block:</strong></p>
<div class="language-stan highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">parameters</span> <span class="p">{</span>
<span class="kt">matrix</span><span class="p">[</span><span class="nv">D</span><span class="p">,</span> <span class="nv">K</span><span class="p">]</span> <span class="nv">beta</span><span class="p">;</span> <span class="c1">// Coefficient matrix, where each column represents the coefficients for one category</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the parameters block, we declare the parameters to be estimated in the model. Here, <code class="language-plaintext highlighter-rouge">beta</code> is a matrix of size <code class="language-plaintext highlighter-rouge">D</code>-by-<code class="language-plaintext highlighter-rouge">K</code>, where each column represents the coefficients for one category. The elements of this matrix will be estimated during the modeling process.</p>
<p><strong>Model Block:</strong></p>
<div class="language-stan highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">model</span> <span class="p">{</span>
<span class="kt">matrix</span><span class="p">[</span><span class="nv">N</span><span class="p">,</span> <span class="nv">K</span><span class="p">]</span> <span class="nv">x_beta</span> <span class="o">=</span> <span class="nv">x</span> <span class="o">*</span> <span class="nv">beta</span><span class="p">;</span> <span class="c1">// Matrix multiplication to obtain linear predictors</span>
<span class="nb">to_vector</span><span class="p">(</span><span class="nv">beta</span><span class="p">)</span> <span class="o">~</span> <span class="nb">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">);</span> <span class="c1">// Prior distribution for the coefficients</span>
<span class="k">for</span> <span class="p">(</span><span class="nv">n</span> <span class="kr">in</span> <span class="mi">1</span><span class="o">:</span><span class="nv">N</span><span class="p">)</span>
<span class="nv">y</span><span class="p">[</span><span class="nv">n</span><span class="p">]</span> <span class="o">~</span> <span class="nf">categorical_logit</span><span class="p">(</span><span class="nv">x_beta</span><span class="p">[</span><span class="nv">n</span><span class="p">]</span><span class="o">'</span><span class="p">);</span> <span class="c1">// Likelihood function for the categorical outcome</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the model block, we define the statistical model.</p>
<ol>
<li>
<p><strong>Matrix Multiplication</strong>: We perform matrix multiplication between the predictor matrix <code class="language-plaintext highlighter-rouge">x</code> and the coefficient matrix <code class="language-plaintext highlighter-rouge">beta</code> to obtain the linear predictors for each category, stored in <code class="language-plaintext highlighter-rouge">x_beta</code>.</p>
</li>
<li>
<p><strong>Prior Distribution</strong>: We specify a prior distribution for the coefficients <code class="language-plaintext highlighter-rouge">beta</code>. Here, we assume a normal prior distribution with mean 0 and standard deviation 5 for all elements of <code class="language-plaintext highlighter-rouge">beta</code>.</p>
</li>
<li>
<p><strong>Likelihood Function</strong>: We define the likelihood function for the categorical outcome variable <code class="language-plaintext highlighter-rouge">y</code>. In this case, we use the <code class="language-plaintext highlighter-rouge">categorical_logit</code> distribution, which models the outcome as a categorical variable with probabilities proportional to the exponential of the linear predictors <code class="language-plaintext highlighter-rouge">x_beta</code>. The loop iterates over each observation <code class="language-plaintext highlighter-rouge">n</code> and assigns the corresponding likelihood of observing the category specified by <code class="language-plaintext highlighter-rouge">y[n]</code>.</p>
</li>
</ol>
<h4 id="ordinal-logistic-regression-example">Ordinal logistic regression example</h4>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rstan</span><span class="p">)</span><span class="w">
</span><span class="n">stan_code</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s1">'
data {
int<lower=2> K;
int<lower=0> N;
int<lower=1> D;
int<lower=1,upper=K> y[N];
row_vector[D] x[N];
}
parameters {
vector[D] beta;
ordered[K-1] c;
}
model {
for (n in 1:N)
y[n] ~ ordered_logistic(x[n] * beta, c);
}'</span><span class="w">
</span><span class="n">stan_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">model_code</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_code</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">K</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="n">rnorm</span><span class="p">(</span><span class="n">N</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">D</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">D</span><span class="p">)</span><span class="w">
</span><span class="n">beta_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">)</span><span class="w">
</span><span class="n">c_true</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">eta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_true</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">eta</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">eta_i</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">eta_i</span><span class="w"> </span><span class="o">></span><span class="w"> </span><span class="n">c_true</span><span class="p">))</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pmin</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">K</span><span class="p">)</span><span class="w">
</span><span class="n">stan_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">N</span><span class="p">,</span><span class="w">
</span><span class="n">D</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">D</span><span class="p">,</span><span class="w">
</span><span class="n">K</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">K</span><span class="p">,</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">stan_model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_data</span><span class="p">)</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fit</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Let’s break down the code:</p>
<p><strong>Data Block:</strong></p>
<div class="language-stan highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">data</span> <span class="p">{</span>
<span class="kt">int</span><span class="o"><</span><span class="na">lower</span><span class="o">=</span><span class="mi">2</span><span class="o">></span> <span class="nv">K</span><span class="p">;</span> <span class="c1">// Number of categories for the ordered outcome variable</span>
<span class="kt">int</span><span class="o"><</span><span class="na">lower</span><span class="o">=</span><span class="mi">0</span><span class="o">></span> <span class="nv">N</span><span class="p">;</span> <span class="c1">// Number of observations</span>
<span class="kt">int</span><span class="o"><</span><span class="na">lower</span><span class="o">=</span><span class="mi">1</span><span class="o">></span> <span class="nv">D</span><span class="p">;</span> <span class="c1">// Number of predictors or features</span>
<span class="kt">int</span><span class="o"><</span><span class="na">lower</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="na">upper</span><span class="o">=</span><span class="nv">K</span><span class="o">></span> <span class="nv">y</span><span class="p">[</span><span class="nv">N</span><span class="p">];</span> <span class="c1">// Array of length N containing the ordered outcome variable</span>
<span class="kt">row_vector</span><span class="p">[</span><span class="nv">D</span><span class="p">]</span> <span class="nv">x</span><span class="p">[</span><span class="nv">N</span><span class="p">];</span> <span class="c1">// Array of length N containing the predictor values for each observation</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the data block, we declare the variables used in the model.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">K</code> represents the number of categories for the ordered outcome variable.</li>
<li><code class="language-plaintext highlighter-rouge">N</code> represents the number of observations.</li>
<li><code class="language-plaintext highlighter-rouge">D</code> represents the number of predictors or features.</li>
<li><code class="language-plaintext highlighter-rouge">y</code> is an array of length <code class="language-plaintext highlighter-rouge">N</code> containing the ordered outcome variable.</li>
<li><code class="language-plaintext highlighter-rouge">x</code> is an array of length <code class="language-plaintext highlighter-rouge">N</code> containing the predictor values for each observation.</li>
</ul>
<p><strong>Parameters Block:</strong></p>
<div class="language-stan highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">parameters</span> <span class="p">{</span>
<span class="kt">vector</span><span class="p">[</span><span class="nv">D</span><span class="p">]</span> <span class="nv">beta</span><span class="p">;</span> <span class="c1">// Coefficients for the predictor variables</span>
<span class="kt">ordered</span><span class="p">[</span><span class="nv">K</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="nv">c</span><span class="p">;</span> <span class="c1">// Cutpoints separating the categories</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the parameters block, we declare the parameters to be estimated in the model.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">beta</code> is a vector of length <code class="language-plaintext highlighter-rouge">D</code>, representing the coefficients for the predictor variables.</li>
<li><code class="language-plaintext highlighter-rouge">c</code> is an ordered array of length <code class="language-plaintext highlighter-rouge">K-1</code>, representing the cutpoints that separate the categories of the ordered outcome variable.</li>
</ul>
<p><strong>Model Block:</strong></p>
<div class="language-stan highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">model</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="nv">n</span> <span class="kr">in</span> <span class="mi">1</span><span class="o">:</span><span class="nv">N</span><span class="p">)</span>
<span class="nv">y</span><span class="p">[</span><span class="nv">n</span><span class="p">]</span> <span class="o">~</span> <span class="nb">ordered_logistic</span><span class="p">(</span><span class="nv">x</span><span class="p">[</span><span class="nv">n</span><span class="p">]</span> <span class="o">*</span> <span class="nv">beta</span><span class="p">,</span> <span class="nv">c</span><span class="p">);</span> <span class="c1">// Likelihood function</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the model block, we define the statistical model.</p>
<ol>
<li><strong>Likelihood Function</strong>: We specify the likelihood function for the ordered outcome variable <code class="language-plaintext highlighter-rouge">y</code>. The <code class="language-plaintext highlighter-rouge">ordered_logistic</code> distribution models the outcome as an ordered categorical variable with ordered cutpoints specified by <code class="language-plaintext highlighter-rouge">c</code>. For each observation <code class="language-plaintext highlighter-rouge">n</code>, we model the probability of observing the category specified by <code class="language-plaintext highlighter-rouge">y[n]</code> given the predictor values <code class="language-plaintext highlighter-rouge">x[n]</code> and the coefficients <code class="language-plaintext highlighter-rouge">beta</code>.</li>
</ol>
<h4 id="a-premiere-on-dirichlet-distribution">A premiere on Dirichlet distribution</h4>
<p>The Dirichlet distribution is a family of continuous multivariate probability distributions, parameterized by a vector <code class="language-plaintext highlighter-rouge">α</code> of positive real numbers. It is commonly used as a prior distribution for categorical or multinomial variables in Bayesian statistics. The Dirichlet distribution can characterize the random variability of a multinomial distribution and is particularly useful for modeling actual measurements due to its ability to generate a wide variety of shapes based on the parameters <code class="language-plaintext highlighter-rouge">α</code>. Dirichlet distribution is useful for modeling categorical data in different applications, such as multinomial models, where Stan provides a categorical family specifically designed to address multinomial or categorical outcomes. This feature enables the fitting of Bayesian models with multinomial responses, facilitating the automatic generation of categorical contrasts (instead of comparing with one reference category). By utilizing the Dirichlet distribution as a prior distribution for categorical or multinomial variables in Bayesian regression, researchers can introduce prior knowledge or beliefs about the distribution of categorical data into their modeling process.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://mc-stan.org/docs/2_20/stan-users-guide/multi-logit-section.html">Multi-Logit Regression</a></li>
<li><a href="https://vincentarelbundock.github.io/rethinking2/12.html">Statistical Rethinking 2: Chapter 12</a></li>
<li><a href="https://mc-stan.org/docs/2_18/stan-users-guide/ordered-logistic-section.html">Ordered Logistic and Probit Regression</a></li>
<li><a href="https://builtin.com/data-science/dirichlet-distribution">The Dirichlet Distribution: What Is It and Why Is It Useful?</a></li>
<li><a href="https://distribution-explorer.github.io/multivariate_continuous/dirichlet.html">https://distribution-explorer.github.io/multivariate_continuous/dirichlet.html</a></li>
<li><a href="https://www.statisticshowto.com/dirichlet-distribution/">Dirichlet Distribution: Simple Definition, PDF, Mean</a></li>
<li><a href="https://www.andrewheiss.com/blog/2023/09/18/understanding-dirichlet-beta-intuition/">Guide to understanding the intuition behind the Dirichlet distribution</a></li>
</ul>
Sat, 25 May 2024 00:00:00 +0000
/posts/2024-05-25/stan_multinomial.html
STANRBayespostsFree Energy Principle and Bayesian Brain<h3 id="bayesian-brain-hypothesis">Bayesian Brain Hypothesis</h3>
<p>As mentioned in the an earlier post the free energy principle, proposed by Karl Friston, suggests that the brain operates as a prediction machine, constantly minimizing surprise or uncertainty by making and updating predictions based on internal models. This principle integrates Bayesian inference with active inference, where actions are guided by predictions and sensory feedback refines them. The Bayesian brain hypothesis, on the other hand, posits that the brain is equipped with an internal (or “generative”) model of the environment, which specifies a recipe for generating sensory observations from hidden states. It is noteworthy that this internal model may not be represented explicitly anywhere in the brain but the brain computes “as if” it had an internal model.</p>
<p>The Bayesian brain hypothesis employs Bayesian probability theory to conceptualize perception as a constructive process based on internal or generative models. The fundamental idea is that the brain possesses a model of the world, which it seeks to optimize using sensory inputs. This perspective views the brain as an inference machine actively predicting and explaining sensations. Central to this hypothesis is a probabilistic model that generates predictions, against which sensory samples are tested to update beliefs about their causes. The generative model consists of a likelihood (probability of sensory data given their causes) and a prior (a priori probability of those causes). Perception, in this framework, involves inverting the likelihood model to access the posterior probability of causes given sensory data, a process equivalent to minimizing the difference between recognition and posterior densities to suppress free energy. More specifically the free energy principle subsumes the Bayesian brain hypothesis, and the two are closely aligned in their emphasis on the brain as a prediction machine that seeks to minimize surprise or prediction error. The free energy principle provides a theoretical foundation for understanding the mechanisms underlying the Bayesian brain hypothesis, and both frameworks are concerned with how the brain makes and updates predictions to minimize surprise and uncertainty.</p>
<p>Hierarchical generative models address the issue of multilevle optimizing priors within a hierarchical structure. In hierarchical models, causes at one level generate subordinate causes at a lower level, with sensory data generated at the lowest level, leading to the optimization of empirical priors in hierarchical models is informed by sensory data, and resulting in an internally consistent representation of sensory causes across multiple levels. In short, the Bayesian brain hypothesis suggests that the brain is an inference engine striving to optimize probabilistic representations of the causes of sensory input. The optimization is facilitated by the variational free-energy principle, which is implemented through various schemes involving message passing or belief propagation among brain areas or units. This connection allows the integration of the free-energy principle with information theory, providing a comprehensive understanding of sensory processing.</p>
<h3 id="adaptive-priors-and-the-bayesian-brain">Adaptive Priors and the Bayesian Brain</h3>
<p>Biological systems indirectly reduce surprise by minimizing free energy, using sensations and predictions based on the hierarchical generative model encoded in internal states. The FEP generalizes the theory of predictive coding, suggesting that living beings can minimize surprise through changes in predictions by altering internal states (perception and learning) or by changing their relation with the environment (action). Action and perception operate reciprocally to maintain homeostasis and optimize an organism’s generative model of the world. Minimizing free energy means inducing an upper bound on surprise through predictions and optimizing brain activity and connectivity, involving action, perception, and learning. This process, mathematically equivalent to maximizing Bayesian model evidence, compels individuals to make Bayesian inferences about their environment. The optimization of world models occurs through evolution, neurodevelopment, and learning with prior beliefs being central in shaping predictions, behavior, and the hierarchical structure of the brain.</p>
<p>The adaptive priors in the Bayesian brain hypothesis refer to the question of whether these priors are innate or learned. The hypothesis suggests that the brain is equipped with an internal (or “generative”) model of the environment, which specifies a “recipe” for generating sensory observations from hidden states. These priors, whether innate or learned, provide the brain with adapted prior guesses when processing sensory information, enabling it to make predictions and inferences about the environment. The debate over the origin of these priors, whether they are innate or acquired through learning, is an ongoing topic of research and discussion within the framework of the Bayesian brain hypothesis. The adaptive nature of these priors is essential for the brain to effectively infer sensory contingencies and exhibit adaptive behavior. The Bayesian brain hypothesis, emphasizes that the brain is constantly updating and refining its internal beliefs to align with the current state of the environment. This process of adaptive inference enables the brain to minimize surprise and uncertainty, ultimately guiding behavior and decision-making in a changing and uncertain world.</p>
<p>Individuals are adapted or optimized to their environment, either through evolution or daily learning, resulting in expectations encoded by neuronal form and activity. While individual expectations may differ, there is a need to inherit some aspect of these expectations to conserve the physical form across generations. The Bayesian prior beliefs play a crucial role in this context, representing expectations about the sensory experiences expected in the world. The Free Energy Principle (FEP) proposes that species-typical patterns of cognition and behavior can be explained through adaptive priors—inherit expectations about the causal structure of the world shaped by evolution and life’s characteristic properties. The FEP posits that living systems must minimize variational free energy to reduce the entropy of sensory and physiological states, ensuring their survival. Variational free energy is an information theoretic quantity limiting the entropy of a generative model entailed by the state of a biological system. Living systems, as per the FEP, actively avoid surprising phase-transitions by minimizing the entropy of their sensory and physical states, exhibiting local ergodicity. This propensity to minimize surprise is a consequence of natural selection, favoring systems capable of avoiding phase-transitions. The ability to repeatedly return to a limited set of unsurprising states delays the deleterious effects of dissipative processes. Adaptive priors play a crucial role in helping the brain make predictions by providing it with prior expectations that are continuously updated based on sensory input and past experience. These priors, whether innate or learned, enable the brain to form adapted prior guesses when processing sensory information, allowing it to make predictions and inferences about the environment. The adaptive nature of these priors is essential for the brain to effectively infer sensory contingencies and exhibit adaptive behavior. They guide the brain’s action-perception cycles toward adaptive and unsurprising states, ultimately contributing to the brain’s ability to minimize surprise and uncertainty. Therefore, adaptive priors serve as a foundational component that enables the brain to make accurate predictions and navigate a changing and uncertain world.</p>
<h3 id="implications-for-theorizing-and-research-in-psychology">Implications for Theorizing and Research in Psychology</h3>
<p>There is an increasing support for the Bayesian brain and FEP in neuroscience based on computer simulations, visual system studies, and brain microcircuit analyses, computational dynamic causal models of fMRI and EEG data explaining neural responses to unpredictable stimuli and complex phenomena like insight and curiosity. While progress has been made in neuroscience, psychologists have been slow to utilize the explanatory power of the FEP. However, some have explored its relevance to various psychological phenomena, such as anxiety, emotion, illusions, delusions, hallucinations, and consciousness. The FEP aligns with ecological psychology principles, emphasizing the reciprocal relationship between organisms and their environment, particularly the concept of affordance.</p>
<p>The FEP also aligns with representationalism, as free-energy is defined in relation to a Bayesian belief about hidden causes in the environment. At the social psychology level, the FEP has inspired models explaining mentalizing and self- and other-representations. Predictive coding is proposed to explain mentalizing, with estimations used to predict others’ behaviors. The FEP has been leveraged to explain self- and other-representations as heuristics reducing uncertainty in social interactions. Beliefs about likely social outcomes are weighted by their precision and updated with experience. Extending beyond social cognition, the FEP has been applied to interpersonal behaviors like dyadic conversation. Communication is seen as a means to resolve uncertainty by adopting a shared narrative, enabling actors to predict each other’s sensations and minimize mutual prediction errors. Finally, the FEP has been applied to large-scale sociocultural phenomena, explaining how shared expectations in social groups become encoded neuronally as high-level priors, guiding cooperative action and reducing uncertainty at individual and group levels.</p>
<p>Finally, active inference, a framework derived from the FEP, represents a novel approach to understanding how organisms learn and make decisions in uncertain environments. Unlike classical approaches to reinforcement learning, which often face issues related to circularity, active inference offers a novel perspective by formulating utility based on adaptive and empirical priors. These priors, acquired over various timescales and nested levels, provide a mechanism for organisms to navigate complex and dynamic environments effectively. In traditional reinforcement learning, the selection of actions is typically based on maximizing expected utility or rewards associated with different outcomes. However, this approach encounters challenges when it comes to explaining how preferences or values emerge and evolve over time. Active inference addresses this issue by grounding utility not only in the immediate rewards but also in the organism’s prior beliefs about the environment. These beliefs are continually updated based on sensory input and previous experiences, allowing the organism to adapt its behavior in response to changing circumstances.By incorporating the concept of priors, active inference offers a more comprehensive understanding of decision-making processes. It acknowledges that organisms do not just respond to immediate rewards but also consider the uncertainty associated with different actions and outcomes. This perspective aligns with the FEP’s emphasis on minimizing surprise or uncertainty in the organism’s internal model of the world. Consequently, active inference provides a theoretical framework that not only addresses the shortcomings of classical reinforcement learning but also offers insights into the underlying mechanisms of adaptive behavior in biological systems.</p>
<h3 id="refrences">Refrences</h3>
<ul>
<li><a href="https://gershmanlab.com/pubs/free_energy.pdf">What does the free energy principle tell us about the brain?</a></li>
<li><a href="https://www.uab.edu/medicine/cinl/images/KFriston_FreeEnergy_BrainTheory.pdf">The free-energy principle: a unified brain theory </a></li>
<li><a href="https://www.fil.ion.ucl.ac.uk/~karl/The%20free-energy%20principle%20-%20a%20rough%20guide%20to%20the%20brain.pdf">The Free-energy principle:a rough guide to the brain?</a></li>
<li><a href="https://www.fil.ion.ucl.ac.uk/~karl/A%20free%20energy%20principle%20for%20the%20brain.pdf">A free energy principle for the brain</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167251/">Active inference and learning</a></li>
<li><a href="https://link.springer.com/article/10.1007/s00422-019-00805-w">Generalised free energy and active inference</a></li>
<li><a href="https://direct.mit.edu/books/oa-monograph/5299/Active-InferenceThe-Free-Energy-Principle-in-Mind">Active Inference: The Free Energy Principle in Mind, Brain, and Behavior</a></li>
<li><a href="https://direct.mit.edu/books/oa-monograph/5299/">Active-InferenceThe-Free-Energy-Principle-in-Mind</a></li>
<li><a href="https://link.springer.com/article/10.1007/s11229-023-04292-2">Incorporating (variational) free energy models into mechanisms</a></li>
<li><a href="https://www.frontiersin.org/articles/10.3389/fnbot.2022.844773">The Problem of Meaning: The Free Energy Principle and Artificial Agency</a></li>
<li><a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2019.00592">“Surprise” and the Bayesian Brain: Implications for Psychotherapy Theory and Practice</a></li>
<li><a href="https://link.springer.com/article/10.1007/s10539-022-09864-z">Free energy: a user’s guide</a></li>
</ul>
Sun, 28 Apr 2024 00:00:00 +0000
/posts/2024-04-28/FEP_BB.html
Free Energy PrincipleBayesian BrainAdaptive PriorspostsSynthetic Data in Health Care<p><strong>Why do we need synthetic data in healthcare?</strong></p>
<p>The use of health data for innovation and care enhancement has been a longstanding aim within the sector. In recent years, the rise of artificial intelligence (AI) and machine learning (ML) has opened up promising avenues for leveraging health data to provide decision support for clinicians, develop more effective treatments, and enhance overall system efficiency. However, despite the potential of these technologies, their widespread adoption faces significant challenges. One of the primary issues is the accessibility of data, with ML applications heavily relying on to large, high-quality datasets for training and validation. In the healthcare domain, where privacy regulations are tight and data sharing is often restricted, accessing such datasets can be particularly challenging. Privacy protection is another critical concern in the healthcare sector, governed by a set of regulations and ethical considerations. These regulations aim to safeguard the security and confidentiality of patient records by controlling the flow of information and preventing unauthorized access. Violations of these rules, often termed “invasions of privacy” or “privacy breaches,” can have serious consequences, including the unauthorized disclosure of personal health information. Privacy threats in healthcare data can take various forms, including identity disclosure, attribute disclosure, and membership disclosure. Adversaries employ a range of techniques to compromise patient privacy, posing significant challenges to health data sharing and access. Privacy concerns have a profound impact on the sharing and accessibility of health data. One consequence of these concerns is the phenomenon known as the <em>“privacy chill”</em> where the reluctance or refusal to share health data due to privacy concerns leads to a slowdown or complete restriction on data sharing initiatives. This phenomenon has been identified to have negative effects on various aspects of healthcare, including the response to health crises such as the COVID-19 pandemic and the recruitment and retention of talented health data scientists. The privacy chill underscores the delicate balance between protecting patient privacy and facilitating data access for research and innovation in healthcare.</p>
<p><strong>What is Synthetic Data?</strong></p>
<p>To address the challenges faced by privacy concerns and data accessibility, researchers have explored the use of synthetic data in healthcare. Synthetic data refers to artificially generated data that replicates the statistical properties of real data while protecting privacy and confidentiality. The U.S. Census Bureau defines synthetic data as <em>new data values generated using statistical models</em>. Synthetic data can be broadly categorized into three types: fully synthetic, partially synthetic, and hybrid. Fully synthetic data involves creating entirely fabricated data without any real data, offering strong privacy control but limited analytic value due to the absence of real-world patterns. Partially synthetic data replaces sensitive variables with synthetic versions while retaining original values, striking a balance between privacy and utility. Hybrid synthetic data combines elements of both real and synthetic data, offering enhanced utility while still providing privacy protection. The categorization of synthetic data types aids in understanding the trade-offs between privacy and utility across different levels of replication and augmentation. For example, the Office for National Statistics (ONS) in the UK has delineated a detailed spectrum of synthetic data types, ranging from purely synthetic structural datasets with minimal analytic value and no disclosure risk to replica-level synthetically augmented datasets that closely mirror real data at the cost of higher disclosure risks. This spectrum offers insights into the varying utility and privacy aspects of synthetic data and provides researchers with options tailored to their specific needs and constraints.</p>
<p><strong>Utilizations of Synthetic Health Data</strong></p>
<p>Synthetic data generation presents a range of opportunities for addressing key challenges in healthcare data. One of the primary advantages of synthetic data is its ability to safeguard individual privacy and record confidentiality by generating data that is difficult to re-identify. By blending “fake” and original data, synthetic datasets strike a balance between utility and privacy, making them valuable for a wide range of applications. For researchers and data users, synthetic data enhances accessibility to health data by providing datasets with minimal disclosure risk. This opens doors for a broader array of users, accelerating research and innovation in healthcare. Furthermore, synthetic data fills the scarcity of realistic data for software development and testing. Synthetic datasets offer cost-effective and authentic test data for software applications, streamlining the development process and ensuring that applications perform as expected in real-world scenarios. As researchers continue to explore the applications of synthetic data, its benefits become increasingly apparent across various domains within healthcare. Synthetic data significantly streamlines the processes involved in training, testing, and deploying AI solutions, promoting more efficient and effective development.</p>
<p><strong>Challenges in Synthetic Health Data</strong></p>
<p>Despite the promise of synthetic data, its widespread adoption in healthcare faces several challenges. One of the primary challenges is evaluating the quality of synthetic data. Quality evaluation involves assessing <em>fidelity and generalizability</em>, among other factors. Fidelity refers to the extent to which synthetic data samples resemble real samples, while diversity ensures adequate coverage of the real data population. Generalizability assesses the ability of synthetic data to accurately reflect real-world phenomena across different contexts. Evaluating synthetic data quality requires robust methodologies and validation frameworks to ensure that synthetic datasets meet the necessary criteria for use in healthcare applications. <em>Privacy implementation</em> is another critical challenge in synthetic data generation. Privacy concerns arise due to the potential for privacy breaches and unauthorized access to sensitive information. While synthetic data aims to protect privacy by generating data that is difficult to re-identify, there is a constant trade-off between privacy and utility. Achieving the right balance between privacy and utility is crucial for ensuring that synthetic data meets the needs of researchers and data users while safeguarding patient privacy. <em>Mitigating bias amplification</em> is another significant challenge in synthetic data generation. Synthetic data inherits biases present in the real data on which it is based, and there is a risk of amplifying these biases during the generation process. Addressing biases and ensuring fairness in synthetic datasets requires validation frameworks and fairness-aware synthesis methods. These methods aim to minimize biases and ensure equitable representation of different subgroups within the synthetic data, promoting fairness and transparency in healthcare applications.</p>
<p><strong>What are the future Directions</strong></p>
<p>Despite these challenges, synthetic data generation holds great promise for transforming healthcare data infrastructure and research. With a vision to bridge the accountability gap through privacy legislation and regulations that balances innovation and privacy in healthcare, moving forward, it is essential to invest in research and development efforts to advance synthetic data techniques and validation frameworks. By addressing challenges and maximizing the potential of synthetic data, researchers can leverag health data to improve patient care, advance medical research, and drive innovation in healthcare delivery.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://doi.org/10.1016/j.neucom.2022.04.053">Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493, 28-45. Advance online publication.</a></li>
<li><a href="https://doi.org/10.1371/journal.pdig.0000082">Synthetic data in health care: A narrative review</a></li>
<li><a href="https://arxiv.org/abs/2302.04062">Machine Learning for Synthetic Data Generation</a></li>
<li><a href="https://www.sciencedirect.com/science/article/abs/pii/S1574013723000138">Synthetic data generation: State of the art in health care domain</a></li>
<li><a href="https://www.cell.com/iscience/fulltext/S2589-0042(22)01603-0_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2589004222016030%3Fshowall%3Dtrue">Synthetic data as an enabler for machine learning applications in medicine</a></li>
</ul>
Tue, 26 Mar 2024 00:00:00 +0000
/posts/2024-03-26/synth_blog.html
PrivacySynthetic DataDigitalHealthpostsFree Energy Principle<h3 id="an-introduction-to-free-energy-principle">An Introduction to Free Energy Principle</h3>
<p>The free energy principle and active inference theory, developed by Karl Friston, provide a theoretical framework for understanding perception, action, and learning in biological and artificial systems. This brief blog post introduces the Free Energy Principle and Active Inference Theory, concepts pioneered by Karl Friston. These theories offer valuable insights into how perception, action, and learning operate in both biological and artificial systems. With a focus on accessibility, this post aims to provide readers with a foundational understanding of these theories, regardless of their background in neuroscience or artificial intelligence. Join us as we explore these fascinating concepts and their implications for understanding the complexities of cognitive processes.</p>
<h3 id="free-energy-principle">Free Energy Principle</h3>
<p>The Free Energy Principle (FEP) is a mathematical principle of information physics, suggesting that the brain reduces surprise or uncertainty by making predictions based on internal models. According to this theory brain as a physical systems minimize a quantity known as surprise or its variational upper bound, called free energy. The free energy principle is a “first principles” approach to understanding behavior and the brain, framed in terms of a single imperative to minimize free energy. This principle relates to the brain’s ability to make predictions by suggesting that the brain minimizes surprise or uncertainty by making predictions based on internal models and updating them using sensory input. This latter integrates Bayesian inference with active inference, where actions are guided by predictions and sensory feedback refines them. In this framework, perception is seen as the minimization of free energy with respect to sensory input, and action is the minimization of the same free energy with respect to outbound action information. Therefore, the free energy principle provides a theoretical basis for understanding how the brain’s internal models are used to make predictions and minimize surprise, which is essential for adaptive behavior and learning.</p>
<p>The concept of “minimizing free energy” revolves around reducing the error or surprise generated by interactions with the external environment, including one’s own body. Free energy serves as a measure of surprise, equating to the negative logarithm of the evidence for the system’s model. The system consistently strives to minimize surprise, aiming to decrease uncertainty in its sensory exchanges with the world. Although no model can perfectly represent the external world, some level of uncertainty is necessary for system optimization. Excessive uncertainty or error contradicts the system’s goal of adhering to its attractor states, defining what it is. The free energy principle dictates that surprise must be actively constrained to ensure the maximization of model evidence. Minimizing free energy corresponds to reducing the error in the system’s predictions about the world, ultimately enhancing the precision of the system’s capacity to model its environment—a concept succinctly termed self-evidencing. Models of perception grounded in prediction, stemming from Helmholtz’s notion of unconscious inference, have evolved over time. These models focus on top-down inferences about one’s environment. State-based predictive models fall under the umbrella of “predictive processing,” where the brain is likened to a scientist making observations, collecting data, and generating hypotheses based on available information.</p>
<p>Predictive processing redefines the brain as a system driven by top-down and bottom-up neural networks, aiming to signal predictions and minimize prediction error. Reentrant loops, providing feedback from higher brain regions to sensory areas, propagate predictions about sensory input. If the prediction sufficiently accounts for the signal, the error is explained away; otherwise, the predictive model is revised. This process can be described in terms of Bayesian belief updating, involving the reciprocal exchange of top-down predictions and bottom-up prediction errors. Bidirectional signals throughout the cortical hierarchy build generative models about the sensed world.</p>
<h3 id="predictive-processing-pp-and-the-free-energy-principle">Predictive Processing (PP) and the Free Energy Principle</h3>
<p>Although the free energy principle primarily focuses on system dynamics, it can be applied to biological phenomena at various scales, from microscopic to brain function and psychological phenomena. Predictive processing is a theory of brain function, suggests that the brain is constantly generating and updating internal models to make predictions about the world. These predictions are compared with sensory input, and any disparities result in prediction errors, which drive learning and adaptation. In the context of the free energy principle and the assumption that the brain minimizes surprise or uncertainty by making and updating predictions based on internal models has a strong connection to predictive processing. The free energy principle can be seen as providing a theoretical foundation for understanding the mechanisms underlying predictive processing. It has been argued that the free energy principle imposes an important constraint, which is related to the minimization of long-term average prediction error in the context of predictive processing. In later posts we are going to discuss the relationships between this constraints and phenomena like consciousness and psychopathology.</p>
<p>The free energy principle and predictive processing share a common emphasis on the brain as a prediction machine, constantly seeking to minimize surprise or prediction error. This shared emphasis has led to efforts to integrate the two frameworks, with the free energy principle providing a unifying perspective that can encompass and explain the mechanisms proposed by predictive processing. A useful example at the level of human psychology is belief formation. Under the free energy principle, “beliefs” align with probability distributions over external states, parameterized by internal representations. Belief formation, a process where the brain learns about the world based on prior observations, is synonymous with predictive processing, which describes belief formation through updating and developing priors. The free energy principle and predictive processing are interconnected in their descriptions of belief formation concerning learning and perceptual inference. The key distinction lies in the free energy principle providing a foundational method that seeks to dissolve disciplinary boundaries. Together, these approaches not only clarify optimal prediction and model generation but also highlight how contextual cues influence the probabilities of specific states or outcomes. Importantly, they allow exploration of outcomes resulting from decisions, choices, and actions.</p>
<p>Predictive processing aims to provide both causal and constitutive explanations of cognitive capacities in alignment with the mechanistic approach to explanation. Currently, cognitivist PP offer descriptive and functional analyses, yet they fail to mechanistically explain all components of cognition. Free energy theory extends cognition beyond the organism and blurs the boundaries between cognitive and non-cognitive phenomena. Explanation in cognitivist PP based on free energy enactivism focuses on describing free energy minimization without considering the structures and mechanisms involved. A mechanistic approach, which identifies relevant components and respects both functional and structural properties, offers a more comprehensive explanation. This approach enables the understanding of how cognitive capacities are realized in different biological systems by identifying the structures and processes underlying prediction error minimization. As cognitive, embodied agents, our directedness towards the world is what distinguishes us from non-living systems. Cognitive mechanisms extend beyond the neural domain, implicating the entire system comprising the nervous system, body, and relevant environmental aspects. The body plays three constitutive roles in cognition: regulating cognitive activity to link cognition and action, acting as a distributor of cognitive load, and constraining information processing by serving as a model of the environment. These roles highlight the intricate relationship between the body and cognition, emphasizing its indispensable role in cognitive processing.</p>
<h3 id="active-inference-framework">Active Inference Framework</h3>
<p>Cognitive systems do not passively observe the world; instead, cognitive agents actively engage with and sample the environment to test their predictions regarding the causes of sensory data. The concept of active inference, derived from the free energy principle, outlines how agents aim to minimize variational free energy by testing and updating generative models through sequences of actions predicted to yield preferred outcomes, known as action policies. In other words, active inference is a way of understanding sentient behavior, emphasizing the implications of the free energy principle for understanding the intricate relationship between the body/actions and mind/cognition. Active inference is based on the premise that an agent’s update rules for action, perception, policy selection, learning, and the encoding of uncertainty are all aiming for minimizing variational free energy. More specfically, it characterizes perception as the minimization of free energy with respect to sensory input and action as the minimization of the same free energy with respect to outbound action information. Likewise, this approach characterizes planning, and action in terms of probabilistic inference, emphasizing the brain’s constant effort to minimize surprise or uncertainty by updating internal models based on sensory input.</p>
<p>Active inference presupposes that agents have preferences for particular states that minimize uncertainty or expected surprise, as surprising states are inherently aversive (i.e. priors in Bayesian terms). Action policies and the subsequent adjustment of generative models are directed toward achieving preferred sensory outcomes and avoiding non-preferred ones. Along these lines, agentic preferences over sensory outcomes are typically treated as prior predictions, referred to as prior preferences. If the actual sensory outcome deviates from the preferred outcomes, it is considered surprising. Agents, when making decisions about potential action sequences, calibrate the expected surprise generated by different courses of action. After calibration, agents can infer the most likely action, a process sometimes described as planning or control as inference. The reaching preferred outcomes through action policies involves minimizing the expected divergence between preferred sensory outcomes and those anticipated when committing to a specific plan. The main aspect is that the actions are selected based on the agent’s estimation of the likelihood of generating preferred sensory outcomes, often aligning with the agent’s existing world model.</p>
<p>The empirical studies of the active inference framework have significantly contributed to our understanding of the brain. These studies have provided insights into the neural mechanisms underlying perception, learning, and decision-making, offering a unified theory at both computational and neural levels of description. For instance, <a href="https://www.sciencedirect.com/science/article/pii/S0306987714004423">Schwartenbeck et al. (2015</a>) demonstrated how the behavior of these mechanisms can be explained by the activity of the brain in the case of addiction. Additionally, <a href="https://www.frontiersin.org/articles/10.3389/fnsys.2021.772641/full">Parr et al. (2021)</a> and Ueltzhöffer (2018) have shown that active inference offers a principled treatment for epistemic exploration as a means of uncertainty reduction, information gain, and intrinsic motivation. Furthermore, Friston et al. (2017) have provided evidence supporting the active inference framework as a promising new computational framework grounded in contemporary neuroscience that can produce human-like perceptual-motor learning.</p>
<h3 id="refrences">Refrences</h3>
<ul>
<li><a href="https://gershmanlab.com/pubs/free_energy.pdf">What does the free energy principle tell us about the brain?</a></li>
<li><a href="https://www.uab.edu/medicine/cinl/images/KFriston_FreeEnergy_BrainTheory.pdf">The free-energy principle: a unified brain theory </a></li>
<li><a href="https://www.fil.ion.ucl.ac.uk/~karl/The%20free-energy%20principle%20-%20a%20rough%20guide%20to%20the%20brain.pdf">The Free-energy principle:a rough guide to the brain?</a></li>
<li><a href="https://www.fil.ion.ucl.ac.uk/~karl/A%20free%20energy%20principle%20for%20the%20brain.pdf">A free energy principle for the brain</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167251/">Active inference and learning</a></li>
<li><a href="https://link.springer.com/article/10.1007/s00422-019-00805-w">Generalised free energy and active inference</a></li>
<li><a href="https://direct.mit.edu/books/oa-monograph/5299/Active-InferenceThe-Free-Energy-Principle-in-Mind">Active Inference: The Free Energy Principle in Mind, Brain, and Behavior</a></li>
<li><a href="https://direct.mit.edu/books/oa-monograph/5299/">Active-InferenceThe-Free-Energy-Principle-in-Mind</a></li>
<li><a href="https://link.springer.com/article/10.1007/s11229-023-04292-2">Incorporating (variational) free energy models into mechanisms</a></li>
<li><a href="https://www.frontiersin.org/articles/10.3389/fnbot.2022.844773">The Problem of Meaning: The Free Energy Principle and Artificial Agency</a></li>
<li><a href="https://www.frontiersin.org/articles/10.3389/fpsyg.2019.00592">“Surprise” and the Bayesian Brain: Implications for Psychotherapy Theory and Practice</a></li>
<li><a href="https://link.springer.com/article/10.1007/s10539-022-09864-z">Free energy: a user’s guide</a></li>
<li><a href="https://link.springer.com/article/10.1007/s13164-021-00579-w">Active Inference as a Computational Framework for Consciousness</a></li>
<li><a href="https://www.frontiersin.org/articles/10.3389/fnsys.2021.772641/full">Understanding, Explanation, and Active Inference</a></li>
<li><a href="https://www.frontiersin.org/articles/10.3389/fncom.2023.1099593/full">A neural active inference model of perceptual-motor learning </a></li>
</ul>
Wed, 28 Feb 2024 00:00:00 +0000
/posts/2024-02-28/FEP.html
Free Energy PrinciplePredictive ProcessingActive InferencepostsBayesian Gaussian Mixture Models (GMM)<h3 id="bayesian-gaussian-mixture-models">Bayesian Gaussian Mixture Models</h3>
<p>In statistics, a mixture model is a probabilistic model used to represent the presence of subpopulations within an overall population without requiring that an individual belongs to a specific subpopulation. It is a flexible approach that can be used to model complex data containing multiple regions with high probability mass, such as multimodal distributions. A typical finite-dimensional mixture model consists of observed random variables, random latent variables specifying the identity of the mixture component of each observation, mixture weights, and parameters. Mixture models can be used to make statistical inferences about the properties of subpopulations without sub-population identity information. Mixture models are also referred to as latent class models if they assume that some of their parameters differ across unobserved subgroups or classes.</p>
<p>Bayesian mixture models can be implemented in Stan, a probabilistic programming language. Mixture models assume that a given measurement can be drawn from one of K data generating processes, each with their own set of parameters. Stan allows for the fitting of Bayesian mixture models using its Hamiltonian Monte Carlo sampler. The models can be parameterized in several ways (see below) and used directly for modeling data with multimodal distributions or as priors for other parameters. The implementation of mixture models in Stan involves defining the model, specifying the priors, and marginalizing out the discrete parameters. Several resources provide examples and tutorials on fitting Bayesian mixture models in Stan, demonstrating the practical implementation of these models.</p>
<p>In this post I will first introduce how mixture models are implemented in Bayesian inference. It is noteworthy to take into consideration non-identifiability inherent these models how the non-identifiability can be tempered with principled prior information. Michael Betancourt has a blogpost describing the problems often encountered with gaussian mixture models, specifically the estimation of parameters of a mixture model and identifiability i.e. the problem with labelling <a href="http://mc-stan.org/documentation/case-studies/identifying_mixture_models.html">mixtures</a>.</p>
<h4 id="single-varaible-example">Single varaible example</h4>
<h5 id="data-simulation">Data Simulation</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(dplyr)
library(ggplot2)
library(ggthemes)
N <- 500
# three clusters
mu <- c(1, 4, 9)
sigma <- c(1.2, 1, 0.8)
# probability of each cluster
Theta <- c(.3, .5, .3)
# Draw which model each belongs to
z <- sample(1:3, size = N, prob = Theta, replace = T)
# white noise
epsilon <- rnorm(N)
# Simulate the data using the fact that y ~ normal(mu, sigma) can be
# expressed as y = mu + sigma*epsilon for epsilon ~ normal(0, 1)
y <- mu[z] + sigma[z]*epsilon
data_frame(y= y, z = as.factor(z)) %>%
ggplot(aes(x = y, fill = z)) +
geom_density(alpha = 0.3) +
ggtitle("Three clusters")
</code></pre></div></div>
<p><img src="/images/gmm_1.png" alt="" /></p>
<h5 id="stan-model-code-and-description">Stan model: code and description</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mixture_model<-'
// saved as finite_mixture_linear_regression.stan
data {
int N;
vector[N] y;
int n_groups;
}
parameters {
vector[n_groups] mu;
vector<lower = 0>[n_groups] sigma;
simplex[n_groups] Theta;
}
model {
vector[n_groups] contributions;
// priors
mu ~ normal(0, 10);
sigma ~ cauchy(0, 2);
Theta ~ dirichlet(rep_vector(2.0, n_groups));
// likelihood
for(i in 1:N) {
for(k in 1:n_groups) {
contributions[k] = log(Theta[k]) + normal_lpdf(y[i] | mu[k], sigma[k]);
}
target += log_sum_exp(contributions);
}
}'
</code></pre></div></div>
<p><strong>Data Block</strong></p>
<ul>
<li><code class="language-plaintext highlighter-rouge">N</code>: Number of observations.</li>
<li><code class="language-plaintext highlighter-rouge">y</code>: Vector of observed responses.</li>
<li><code class="language-plaintext highlighter-rouge">n_groups</code>: Number of mixture components or groups.</li>
</ul>
<p><strong>Parameters Block</strong></p>
<ul>
<li><code class="language-plaintext highlighter-rouge">mu</code>: Vector of means for each mixture component.</li>
<li><code class="language-plaintext highlighter-rouge">sigma</code>: Vector of standard deviations for each mixture component.</li>
<li><code class="language-plaintext highlighter-rouge">Theta</code>: Vector of mixing proportions, representing the probability of each group.</li>
</ul>
<p><strong>Model Block</strong></p>
<ul>
<li><strong>Priors</strong>: Normal priors are specified for the means <code class="language-plaintext highlighter-rouge">mu</code> with a mean of 0 and a standard deviation of 10. Cauchy priors are specified for the standard deviations <code class="language-plaintext highlighter-rouge">sigma</code> with a location of 0 and a scale of 2. Dirichlet priors are specified for the mixing proportions <code class="language-plaintext highlighter-rouge">Theta</code> with equal concentration parameters of 2.0 for each group.</li>
<li><strong>Likelihood</strong>: The likelihood is constructed within a nested loop. For each observation <code class="language-plaintext highlighter-rouge">i</code> and each group <code class="language-plaintext highlighter-rouge">k</code>, it calculates the log-likelihood of the observation given the mean and standard deviation of that group. These log-likelihoods are stored in the <code class="language-plaintext highlighter-rouge">contributions</code> vector.</li>
<li><strong>Log-Sum-Exp Trick</strong>: To avoid numerical instability when dealing with small probabilities, the log-sum-exp trick is used. The <code class="language-plaintext highlighter-rouge">log_sum_exp</code> function sums up the contributions after exponentiating them. This is done to compute the log-likelihood of the data given the mixture model.</li>
<li><strong>Target</strong>: The <code class="language-plaintext highlighter-rouge">target</code> is incremented by the log of the sum of exponentiated contributions for each observation. The <code class="language-plaintext highlighter-rouge">target</code> is essentially the log-posterior, and the goal of Stan is to maximize it during sampling.</li>
</ul>
<h5 id="fitting-and-output">Fitting and output</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(rstan)
options(mc.cores = parallel::detectCores())
fit=stan(model_code=mixture_model, data=list(N= N, y = y, n_groups = 3), iter=3000, warmup=500, chains=3)
print(fit)
params=extract(fit)
#density plots of the posteriors of the mixture means
par(mfrow=c(1,3))
plot(density(params$mu[,1]), ylab='', xlab='mu[1]', main='')
abline(v=c(8), lty='dotted', col='red',lwd=2)
plot(density(params$mu[,2]), ylab='', xlab='mu[1]', main='')
abline(v=c(0), lty='dotted', col='red',lwd=2)
plot(density(params$mu[,3]), ylab='', xlab='mu[1]', main='')
abline(v=c(4), lty='dotted', col='red',lwd=2)
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Inference for Stan model: 9c40393d28e90e2c335fff95de690860.
3 chains, each with iter=3000; warmup=500; thin=1;
post-warmup draws per chain=2500, total post-warmup draws=7500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
mu[1] 6.35 3.06 3.76 0.65 1.21 8.96 9.02 9.11 2 21.87
mu[2] 3.72 3.05 3.74 0.59 0.94 1.25 8.96 9.09 2 13.86
mu[3] 4.03 0.00 0.17 3.71 3.92 4.03 4.14 4.36 2575 1.00
sigma[1] 0.85 0.17 0.23 0.62 0.69 0.73 1.02 1.41 2 2.32
sigma[2] 1.01 0.19 0.27 0.64 0.73 1.03 1.19 1.60 2 1.76
sigma[3] 1.13 0.00 0.12 0.92 1.05 1.12 1.20 1.39 3232 1.00
Theta[1] 0.27 0.00 0.04 0.19 0.25 0.27 0.29 0.34 1186 1.02
Theta[2] 0.27 0.00 0.05 0.18 0.24 0.26 0.29 0.40 1553 1.00
Theta[3] 0.47 0.00 0.06 0.32 0.44 0.47 0.51 0.57 1702 1.00
lp__ -1161.00 0.05 2.18 -1166.34 -1162.15 -1160.64 -1159.40 -1157.91 2064 1.00
Samples were drawn using NUTS(diag_e) at Tue Feb 6 13:03:03 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
</code></pre></div></div>
<p><img src="/images/gmm_2.png" alt="" /></p>
<h4 id="example-with-multiple-variable">Example with multiple variable</h4>
<h5 id="data-simulation-1">Data Simulation</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>library(MASS)
#first cluster
mu1=c(0,0,0,0)
sigma1=matrix(c(0.2,0,0,0,0,0.2,0,0,0,0,0.1,0,0,0,0,0.1),ncol=4,nrow=4, byrow=TRUE)
norm1=mvrnorm(30, mu1, sigma1)
#second cluster
mu2=c(10,10,10,10)
sigma2=sigma1
norm2=mvrnorm(30, mu2, sigma2)
#third cluster
mu3=c(4,4,4,4)
sigma3=sigma1
norm3=mvrnorm(30, mu3, sigma3)
norms=rbind(norm1,norm2,norm3) #combine the 3 mixtures together
N=90 #total number of data points
Dim=4 #number of dimensions
y=array(as.vector(norms), dim=c(N,Dim))
mixture_data=list(N=N, D=4, K=3, y=y)
as.data.frame(norms) %>%
pivot_longer(colnames(as.data.frame(norms)), names_to = "var", values_to = "value")%>%
ggplot( aes(x=value, color=var)) + geom_density() +
ggtitle("Three clusters on four variables")
</code></pre></div></div>
<p><img src="/images/gmm_3.png" alt="" /></p>
<h5 id="stan-model-code-and-description-1">Stan model: code and description</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mixture_model<-'
data {
int D; //number of dimensions
int K; //number of gaussians
int N; //number of data
vector[D] y[N]; //data
}
parameters {
simplex[K] theta; //mixing proportions
ordered[D] mu[K]; //mixture component means
cholesky_factor_corr[D] L[K]; //cholesky factor of covariance
}
model {
real ps[K];
for(k in 1:K){
mu[k] ~ normal(0,3);
L[k] ~ lkj_corr_cholesky(4);
}
for (n in 1:N){
for (k in 1:K){
ps[k] = log(theta[k])+multi_normal_cholesky_lpdf(y[n] | mu[k], L[k]);
}
target += log_sum_exp(ps);
}
}'
</code></pre></div></div>
<p><strong>Data Block</strong></p>
<ul>
<li><code class="language-plaintext highlighter-rouge">D</code>: Number of dimensions.</li>
<li><code class="language-plaintext highlighter-rouge">K</code>: Number of Gaussian components.</li>
<li><code class="language-plaintext highlighter-rouge">N</code>: Number of data points.</li>
<li><code class="language-plaintext highlighter-rouge">y</code>: An array of vectors, each representing a data point in D dimensions.</li>
</ul>
<p><strong>Parameters Block</strong></p>
<ul>
<li><code class="language-plaintext highlighter-rouge">theta</code>: Mixing proportions. It is a simplex, ensuring that the proportions sum to 1.</li>
<li><code class="language-plaintext highlighter-rouge">mu</code>: Mixture component means. These are ordered variables.</li>
<li><code class="language-plaintext highlighter-rouge">L</code>: Cholesky factors of the covariance matrices for each component.</li>
</ul>
<p><strong>Model Block</strong></p>
<ul>
<li>
<p><strong>Priors</strong>: Priors are specified for the means <code class="language-plaintext highlighter-rouge">mu</code> and the Cholesky factors <code class="language-plaintext highlighter-rouge">L</code>. Each mean is drawn from a normal distribution with a mean of 0 and a standard deviation of 3. The Cholesky factor is drawn from a LKJ correlation distribution with shape parameter 4.</p>
</li>
<li>
<p><strong>Log-Probability Calculation</strong>: For each data point <code class="language-plaintext highlighter-rouge">n</code> and each component <code class="language-plaintext highlighter-rouge">k</code>, the log-probability <code class="language-plaintext highlighter-rouge">ps[k]</code> is calculated. This log-probability is the logarithm of the product of the mixing proportion and the multivariate normal density of the data point under the k-th component.</p>
</li>
<li>
<p><strong>Target Increment</strong>: The <code class="language-plaintext highlighter-rouge">target</code> is incremented by the logarithm of the sum of exponentiated log-probabilities <code class="language-plaintext highlighter-rouge">ps</code>. This step ensures that the model assigns higher probability to data points that are well-explained by one of the Gaussian components.</p>
</li>
</ul>
<h5 id="fitting-and-output-1">Fitting and output</h5>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fit=stan(model_code=mixture_model, data=mixture_data, iter=3000, warmup=1000, chains=1)
print(fit)
Inference for Stan model: f913dae683b9f29657b0863fec348d71.
1 chains, each with iter=3000; warmup=1000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=2000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
theta[1] 0.33 0.00 0.05 0.24 0.30 0.33 0.37 0.43 2120 1.00
theta[2] 0.33 0.00 0.05 0.24 0.30 0.33 0.36 0.44 1740 1.00
theta[3] 0.33 0.00 0.05 0.24 0.30 0.33 0.37 0.43 1854 1.00
mu[1,1] 3.80 0.00 0.11 3.55 3.74 3.80 3.86 3.99 632 1.01
mu[1,2] 3.91 0.01 0.12 3.67 3.84 3.90 3.98 4.15 236 1.00
mu[1,3] 4.02 0.01 0.13 3.78 3.92 4.02 4.11 4.29 344 1.00
mu[1,4] 4.09 0.01 0.15 3.82 3.99 4.09 4.20 4.39 259 1.00
mu[2,1] 9.77 0.01 0.12 9.51 9.70 9.79 9.86 9.98 405 1.00
mu[2,2] 9.96 0.00 0.11 9.74 9.90 9.96 10.03 10.20 794 1.00
mu[2,3] 10.09 0.00 0.11 9.91 10.01 10.07 10.15 10.33 518 1.00
mu[2,4] 10.18 0.01 0.12 9.97 10.09 10.17 10.25 10.43 498 1.00
mu[3,1] -0.22 0.01 0.13 -0.49 -0.30 -0.21 -0.14 0.01 409 1.01
mu[3,2] -0.10 0.01 0.13 -0.37 -0.17 -0.09 -0.02 0.17 218 1.00
mu[3,3] 0.07 0.01 0.13 -0.17 0.00 0.07 0.15 0.36 81 1.00
mu[3,4] 0.16 0.01 0.12 -0.03 0.08 0.15 0.23 0.42 541 1.00
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
params=extract(fit)
#density plots of the posteriors of the mixture means
par(mfrow=c(1,3))
plot(density(params$mu[,1,1]), ylab='', xlab='mu[1]', main='')
lines(density(params$mu[,1,2]), col=rgb(0,0,0,0.7))
lines(density(params$mu[,1,3]), col=rgb(0,0,0,0.4))
lines(density(params$mu[,1,4]), col=rgb(0,0,0,0.1))
abline(v=c(4), lty='dotted', col='red',lwd=2)
plot(density(params$mu[,2,1]), ylab='', xlab='mu[2]', main='')
lines(density(params$mu[,2,2]), col=rgb(0,0,0,0.7))
lines(density(params$mu[,2,3]), col=rgb(0,0,0,0.4))
lines(density(params$mu[,2,4]), col=rgb(0,0,0,0.1))
abline(v=c(10), lty='dotted', col='red',lwd=2)
plot(density(params$mu[,3,1]), ylab='', xlab='mu[3]', main='')
lines(density(params$mu[,3,2]), col=rgb(0,0,0,0.7))
lines(density(params$mu[,3,3]), col=rgb(0,0,0,0.4))
lines(density(params$mu[,3,4]), col=rgb(0,0,0,0.1))
abline(v=c(0), lty='dotted', col='red',lwd=2)
</code></pre></div></div>
<p><img src="/images/gmm_4.png" alt="" /></p>
<h3 id="advantages-and-limitations">Advantages and Limitations</h3>
<p>Bayesian mixture models offer several advantages in statistical modeling. Their inherent flexibility makes them well-suited for diverse tasks such as clustering, data compression, outlier detection, and generative classification. The Bayesian framework’s ability to incorporate prior knowledge enhances model accuracy, especially when informative prior information is available. Moreover, these models effectively handle unobserved heterogeneity by integrating multiple data generating processes, proving valuable when data alone may not fully identify underlying patterns. The stability provided by Bayesian estimation ensures reliable posterior distributions, reducing sensitivity to issues like singularities, over-fitting, and violated identification criteria. Bayesian mixture models also facilitate the examination of the posterior distribution of the number of classes, offering insights into the underlying class structure of the data. However, the use of Bayesian mixture models comes with certain limitations. Applying these models demands a high level of statistical expertise to appropriately specify priors and ensure correct model formulation, presenting a challenge for practitioners lacking a strong background in Bayesian statistics. The complexity of posterior inference is compounded by label switching, a phenomenon that complicates the interpretation of results. Bayesian nonparametric mixture models, in particular, may suffer from inconsistency in estimating the number of clusters, impacting their performance in clustering applications. Additionally, model fitting challenges arise, and careful evaluation of inaccuracies in predictions and comparison with alternative models are essential to address potential shortcomings.</p>
<p>In this post, we learned to fit mixture models using Stan. We saw how to evaluate model fit using the usual prior and posterior predictive checks, and to investigate parameter recovery. Such mixture models are notoriously difficult to fit, but they have a lot of potential in cognitive science applications, especially in developing computational models of different kinds of cognitive processes. The reader interested in a deeper understanding of potential challanges in the process can refer to Betancourt discussion of identification problems in Bayesian mixture models in a <a href="https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html">case study</a>.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://modernstatisticalworkflow.blogspot.com/2016/10/finite-mixture-models-in-stan.html">Finite mixture models in Stan</a></li>
<li><a href="https://maggielieu.com/2017/03/21/multivariate-gaussian-mixture-model-done-properly/">Multivariate Gaussian Mixture Model done properly </a></li>
<li><a href="https://mc-stan.org/docs/stan-users-guide/mixture-modeling.html">Finite Mixtures</a></li>
<li><a href="https://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html">Identifying Bayesian Mixture Models</a></li>
<li><a href="https://vasishth.github.io/bayescogsci/book/ch-mixture.html">Mixture models</a></li>
<li><a href="https://rpubs.com/kaz_yos/fmm2">Bayesian Density Estimation (Finite Mixture Model) </a></li>
<li><a href="https://hal.science/hal-03866434/document">Bayesian mixture models (in)consistency for the number of clusters</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6459682/">Advantages of a Bayesian Approach for Examining Class Structure in Finite Mixture Models</a></li>
</ul>
Sun, 28 Jan 2024 00:00:00 +0000
/posts/2024-01-28/Bayes_GMM.html
STANRBayespostsUncertainty and Bias in AI Models<h3 id="uncertainty-and-bias-in-ai-models">Uncertainty and Bias in AI Models</h3>
<p>Although Artificial Intelligence (AI) has rapidly become central to different aspects of modern life and the models grow more sophisticated, challenges of uncertainty and bias have emerged as important factors that require thorough examination. Understanding the different types of uncertainty and bias in AI models is essential for building responsible, equitable, and reliable AI systems that benefit society at large. Here we discuss different types of uncertainty and bias in AI models, highlighting their implications and discussing potential strategies to address them.</p>
<p><strong>Uncertainty</strong> in AI models arises from various sources, aleatoric uncertainty refers to inherent randomness or variability in data, which can occur due to measurement errors or natural variability in the environment. On the other hand, epistemic uncertainty, on the other hand, stems from a lack of knowledge or information, such as when a model encounters data that fall outside its training distribution. Epistemic uncertainty is particularly pertinent in scenarios where AI models face novel situations, as they may struggle to provide accurate predictions without sufficient training data. Finally, prediction uncertainty refers to the uncertainty associated with the model’s predictions. It arises due to the complexity of the underlying problem, limited training data, or inherent noise in the data. The uncertainty issue in AI models can lead to less systematic, accurate, or relevant predictions, particularly for groups with less accurate data in the training sample.</p>
<p>Therefore, uncertainty in AI models carries significant implications such as potentially yielding inconsistent or unreliable predictions and decision-making being influenced, as decision-makers must weigh uncertainty levels when relying on AI predictions. Along these lines an effective mitigation of uncertainty in AI models involves several strategies. Firstly, quantifying uncertainty through techniques like Bayesian inference and Monte Carlo dropout provides insights into model reliability. Bayesian methods, which provide a framework for quantifying uncertainty, have gained prominence. These methods enable the integration of prior knowledge and the propagation of uncertainty throughout the model. Secondly, maintaining robust training data quality and representativeness helps mitigate uncertainty by enhancing the model’s capability to handle uncertainty. Finally, regular model evaluation and validation are crucial for identifying and addressing uncertainty, including assessing performance on various subgroups and monitoring for unintended biases or unfair outcomes. Finally, development of more interpretable models enhances the transparency of decision-making processes, providing insights into uncertainty sources. Epistemic uncertainty, aleatoric uncertainty, and prediction uncertainty present complex challenges that require interdisciplinary collaborations between AI researchers, domain experts, and ethicists.</p>
<p><strong>Bias</strong> in AI models is another major concern, reflecting the potential for algorithms to perpetuate and amplify existing societal biases present in training data. That is, the historical underrepresentation of certain groups in datasets can result in biased predictions or recommendations, reinforcing societal inequalities. Selection bias occurs when the data used for training do not accurately represent the broader population, leading to models that perform well on specific subgroups but fail on others. This concept is also discussed as data bias referring to biases present in the training data used to train AI models that data can arise from various sources, such as sampling bias, label bias, or underrepresentation of certain groups. Biased training data can lead to biased model predictions and unfair outcomes. Furthermore, the non-interpretability of some AI algorithms exacerbates the challenge of identifying and correcting bias, as it can be challenging to identify the specific causes behind biased predictions and affects transparency, accountability, and the ability to mitigate bias effectively, leading to some ethical and societal implications that accompany the presence of bias in AI models such as discriminatory outcomes in domains such as criminal justice or lending, perpetuating unjust disparities. This later also erodes trust in AI systems as when they are perceived as biased or unfair, users may hesitate to trust or adopt these technologies, impeding their potential benefits.</p>
<p>Effectively mitigating bias in AI models involves several strategies, going from bias detection and evaluation methods including fairness metrics, bias audits, and interpretability tools, are vital for identifying and quantifying biases as well as ensuring diversity and representativeness in training data through careful collection, preprocessing, and bias mitigation measures enhances the model’s ability to make fair predictions. Moreover, regular model evaluation and validation, including subgroup performance assessment and monitoring for unintended biases or unfair outcomes, are crucial to addressing and rectifying biases in AI models. As AI continues to shape society, the proactive mitigation of bias ensures that these transformative technologies contribute positively to a fair and just future for all. Collaboration between diverse stakeholders, including ethicists, domain experts, and communities affected by AI, fosters a collective approach to identifying and mitigating biases. Algorithmic bias, data bias, and representation bias collectively underscore the pressing need to address biases at multiple levels, from the data sources to the algorithms themselves. Fairness-aware machine learning algorithms aim to rectify biases in model outputs by explicitly considering fairness metrics during training and by employing fairness-aware algorithms, scrutinizing training data, and promoting representation diversity, the AI community can build models that more accurately reflect the complexities of the real world while upholding ethical standards.</p>
<p>AI researchers and practitioners are increasingly recognizing the importance of responsible AI development. Efforts to address uncertainty and bias in AI models include the development of fairness-aware algorithms that explicitly consider fairness metrics during training, as well as transparent AI models, which provide explanations for their predictions, offer insights into their decision-making process, facilitating the identification of bias and the establishment of trust. This can enable stakeholders to assess the fairness and reliability of the models and make informed decisions. Along the same lines, careful data collection and preprocessing techniques, along with addressing data quality limitations paired with regular auditing and testing of AI models on various subgroups can help identify and rectify bias, ensuring more equitable outcomes and reduce improve model performance.</p>
<p>As AI continues to play an increasingly central role in our lives, it is imperative to navigate these challenges responsibly, striving for AI systems that serve the broader good while upholding ethical and equitable principles. By implementing strategies such as transparent model design, diverse training data, and regular evaluation, we can mitigate uncertainties and biases in AI models and ensure their responsible and equitable use. The challenges of uncertainty and bias in AI models necessitate a comprehensive understanding of their various forms and impacts. Addressing these challenges is central for developing AI systems that are robust, equitable, and accountable. Tackling uncertainty through probabilistic methods and mitigating bias via data preprocessing and fairness-aware algorithms are essential steps in creating AI models that benefit society without exacerbating existing societal disparities.</p>
<h3 id="implications-for-alignment-and-responsible-ai">Implications for Alignment and Responsible AI</h3>
<p>The alignment problem, the pursuit of AI models’ actions aligning with human values and intentions, is profoundly impacted by uncertainty. Uncertainty and bias in AI models can affect the alignment problem in several ways. Uncertainty and bias in AI models can lead to unintended consequences and undesirable outcomes. Biased predictions or uncertain outputs can result in actions that deviate from human intentions, potentially causing harm or violating ethical principles. More specifically, epistemic uncertainty can obscure models’ comprehension of their goals and introduce unforeseen consequences. Aleatoric uncertainty undermines the consistency of model behavior, rendering it challenging to ensure adherence to desired outcomes. Prediction uncertainty puts doubt in human-AI collaborations, hampering trust and effective decision-making. Mitigating uncertainty’s adverse effects on alignment necessitates advanced interpretability and explainability techniques, enabling humans to comprehend the rationale behind AI decisions. Moreover, uncertainty-aware AI architectures that acknowledge and reason about uncertainty levels can help align AI behaviors with human values. Along the same lines, bias-aware AI systems, cognizant of bias types and their implications, pave the way for safer and more reliable decision-making. Bias detection mechanisms, transparency tools, and fairness-aware algorithms are essential to rulling out biases in AI models. Responsible AI requires mechanisms for accountability and transparency, enabling users to understand and challenge the decisions made by AI models. Understanding and addressing these challenges are crucial for developing AI systems that align with human values, are fair and equitable, and inspire trust.</p>
<h3 id="references">References:</h3>
<ul>
<li>O’Neil, C. (2016). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Broadway Books.</li>
<li>Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning. http://fairmlbook.org</li>
<li><a href="https://biglinden.com/uncertainty-and-bias-in-ai-and-machine-learning/">Understanding Bias & Uncertainty for AI & ML</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8830968/">AI bias: exploring discriminatory algorithmic decision-making models and the application of possible machine-centric solutions adapted from the pharmaceutical industry</a></li>
<li><a href="https://aithority.com/machine-learning/the-uncertainty-bias-in-ai-and-how-to-tackle-it/">The Uncertainty Bias in AI and How to Tackle it</a></li>
<li><a href="https://machinelearningmastery.com/uncertainty-in-machine-learning/">A Gentle Introduction to Uncertainty in Machine Learning</a></li>
<li><a href="https://towardsdatascience.com/uncertainty-in-deep-learning-brief-introduction-1f9a5de3ae04">Uncertainty in Deep Learning — Brief Introduction</a></li>
<li><a href="https://www.kdnuggets.com/2022/04/uncertainty-quantification-artificial-intelligencebased-systems.html">Uncertainty Quantification in Artificial Intelligence-based Systems</a></li>
<li><a href="https://www.infoworld.com/article/3607748/3-kinds-of-bias-in-ai-models-and-how-we-can-address-them.html">“3 kinds of bias in AI models — and how we can address them”</a></li>
</ul>
Fri, 22 Dec 2023 00:00:00 +0000
/posts/2023-12-22/Uncertainty_bias.html
Artificial intelligenceUncertaintyBiaspostsResponsible AI and Supervised fine-tuning (SFT)<h3 id="supervised-fine-tuning-sft">Supervised fine-tuning (SFT)</h3>
<p>Supervised fine-tuning (SFT) is a technique used in machine learning, specifically in the field of transfer learning that involves taking a pre-trained model, typically a deep neural network, that has been trained on a large dataset for a related task, and then further training it on a smaller labeled dataset specific to the task at hand. The idea is to leverage the knowledge and representations learned by a model on a source task and apply it to a target task. The pre-trained model serves as a starting point, providing a good initialization for the target task. However, since the pre-trained model is trained on a different task or dataset, it may not directly fit the target task’s data. This is where supervised fine-tuning comes into play. The pre-trained model’s parameters are further adjusted or fine-tuned using the labeled data from the target task. During fine-tuning, the model’s weights are updated using backpropagation and gradient descent, optimizing the model’s performance specifically for the target task. By fine-tuning the pre-trained model, it becomes more tailored and adapted to the specific characteristics and nuances of the target task’s data. Supervised fine-tuning is particularly useful when the target task has a smaller labeled dataset compared to the original pre-training dataset. Instead of training a model from scratch on the target task, which may require a larger amount of labeled data, SFT allows for efficient re-use of the knowledge already captured by the pre-trained model. This can significantly reduce the training time and resource requirements for the target task while still achieving good performance.</p>
<h3 id="alignment-and-responsible-ai-through-sft">Alignment and Responsible AI through SFT</h3>
<p>SFT offers one of multiple ways towards alignment between human values and the behavior of AI systems. As AI continues to advance and play an increasingly prominent role in our lives, ensuring alignment becomes a critical objective. SFT presents a promising approach to bridge the gap between pre-trained models and specific tasks, allowing for customization and alignment with human preferences. Here we discuss SFT, its underlying principles, and its potential as a technique toward achieving alignment. At its core, SFT involves taking a pre-trained model, typically trained on a large dataset, and fine-tuning it on a more specific task using labeled data. The pre-training phase equips the model with a general understanding of various concepts, while the fine-tuning phase refines its performance for the task at hand. This two-step process enables the model to leverage existing knowledge and adapt it to the specific requirements and nuances of a particular task, facilitating alignment with human values and objectives. One of the key advantages of SFT is its ability to incorporate human supervision during the fine-tuning process. By providing labeled data and guidance, human experts can shape the behavior of the AI system to align with desired outcomes. This human feedback serves as a crucial mechanism for correcting biases, refining decision-making, and incorporating ethical considerations. By utilizing labeled data, the fine-tuning process becomes more transparent, as the model’s behavior can be linked to specific examples and human annotations. This transparency enables stakeholders to understand and assess the decision-making process of the AI system.</p>
<p>While SFT presents a promising technique toward achieving alignment, it is important to acknowledge its challenges and limitations. The quality and representativeness of the labeled data used for fine-tuning significantly impact the alignment achieved. Biases or inaccuracies in the labeled data can propagate into the fine-tuned model, potentially leading to misalignment. To overcome these challenges, ongoing research focuses on developing techniques that improve the quality and diversity of labeled data. Active learning approaches aim to intelligently select data points for labeling, maximizing the information gained while minimizing the need for extensive labeling efforts. Adversarial fine-tuning techniques seek to identify and mitigate biases introduced during the fine-tuning process, promoting fairness and alignment. These advancements contribute to the ongoing refinement of SFT and its potential to achieve greater alignment between AI systems and human values. SFT is related to responsible AI in the context of adapting pre-trained models to specific tasks while considering ethical and responsible considerationsFor instance when applying SFT, one has to ensure that the pre-trained model used for fine-tuning is itself fair and unbiased and to avoid reinforcing or amplifying any biases present in the pre-trained model during the fine-tuning process. Additionally, data used for fine-tuning should be carefully selected and representative to mitigate biases in the resulting model. Liewise, with SFT, the adapted model should be continuously evaluated to assess its performance, including monitoring for biases, fairness, and unintended consequences. Feedback mechanisms should be established to gather insights from users and stakeholders, enabling iterative improvements and addressing any ethical concerns that arise during the deployment of the fine-tuned model. It is important to establish clear ownership and responsibility for the fine-tuning process, including monitoring and evaluating the impact of the adapted model to detect and address any unintended consequences or ethical issues.</p>
<h3 id="supervised-fine-tuning-sft-and-reinforcement-learning-from-human-feedback-rlhf">Supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)</h3>
<p>Supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are different approaches to improve machine learning models. The main objective of SFT is to adapt a pre-trained model to a specific target task by further training it on a labeled dataset. The focus is on achieving high performance on the target task by leveraging knowledge from the pre-training. RLHF, on the other hand, aims to improve an model’s decision-making through interaction with human feedback, whether it be explicit rewards or evaluations. Morevoer, SFT follows a supervised learning paradigm, where the model is trained on labeled examples with a well-defined loss function. It leverages the labeled data to update the model’s weights and typically requires a relatively large labeled dataset specific to the target task. The pre-trained model serves as a starting point and requires further training on this labeled data. RLHF, follows a reinforcement learning paradigm, where the agent interacts with an environment and learns from feedback signals, typically in the form of rewards or evaluations. It employs exploration-exploitation strategies to optimize its policy. It can learn from sparse or even noisy feedback, as long as it provides sufficient information for the agent to improve its decision-making. RLHF can potentially learn from a smaller amount of human feedback, which can be more easily obtained compared to large labeled datasets. In the context of LMMs,the RLHF framework is generic and can be used to optimize an LLM based on a variety of different objectives using a unified approach. RLHF can be used to transform generic, pre-trained LLMs into the impressive information-seeking dialogue agents that we commonly see today (e.g., ChatGPT). SFT in contrast, is a method to adapt a pre-trained model to a specific target task by fine-tuning it on labeled data, with a focus on achieving high performance on the target task.</p>
<h3 id="healthcare">Healthcare</h3>
<p>SFT finds valuable applications in the healthcare field, enhancing the performance of AI models across various domains. In medical image analysis, SFT enables the fine-tuning of pre-trained deep learning models for tasks like image segmentation, object detection, and classification. This approach allows models to learn specific medical features and patterns, leading to improved accuracy in disease diagnosis, abnormality detection, and treatment planning. Another area where SFT proves beneficial is in the analysis of electronic health records (EHRs). By fine-tuning language models using labeled EHR data, SFT aids in extracting relevant information from unstructured clinical text, facilitating tasks such as identifying medical conditions, predicting patient outcomes, and supporting clinical decision-making. In the era of telemedicine and remote patient monitoring, SFT finds application in analyzing patient data collected through wearable devices, sensors, and remote monitoring systems. SFT could enable accurate detection of abnormalities, early warning signs, and personalized healthcare recommendations to enhance remote patient care and enables more effective telemedicine practices. These examples illustrate the broad potential of SFT in healthcare, where it assists in improving the accuracy, efficiency, and effectiveness of AI models. By leveraging supervised fine-tuning techniques, healthcare providers can harness the power of AI to support diagnostics, treatment planning, drug discovery, clinical decision-making, and remote patient monitoring.</p>
<h3 id="downsides">Downsides</h3>
<p>Although SFT offers valuable advantages, it is important to consider its potential downsides. Fine-tuning large-scale models can be computationally intensive, necessitating substantial computing infrastructure and time. Another limitation is the reliance on labeled data for the fine-tuning process. Obtaining high-quality labeled data can be time-consuming, costly, and resource-intensive. For instance in healthcare, acquiring labeled data that accurately represents the complex and diverse nature of medical conditions and patient populations can be particularly challenging. Insufficient or biased labeled data may lead to suboptimal fine-tuning results and impact the generalizability of the model’s performance. Moreover in the context of healthcare the use of labeled data containing sensitive patient information raises concerns about privacy and data protection. Safeguarding patient privacy and complying with relevant data protection regulations is crucial to maintain trust in healthcare AI applications. Additionally, the potential bias in the labeled data used for fine-tuning can introduce ethical challenges. Biases in the data, such as disparities in healthcare access or underrepresentation of certain demographic groups, can be perpetuated and amplified by the fine-tuned models, leading to inequitable outcomes and exacerbating healthcare disparities.</p>
<h3 id="future-of-sft">Future of SFT</h3>
<p>SFT holds a promising future across various domains with several key aspects that shape its potential in the coming years. Researchers are continually refining fine-tuning techniques, exploring novel architectures, and optimizing hyperparameters to achieve better results. As models become more sophisticated and datasets improve in size and quality, SFT will likely lead to even more accurate and effective AI systems. By leveraging pre-trained models as a foundation for fine-tuning, SFT enables the transfer of knowledge from one domain to another. Future advancements in transfer learning approaches will enable models to adapt more efficiently to new tasks and domains. This enhanced generalization capability will reduce the need for extensive retraining and accelerate the deployment of AI systems in real-world applications. Moreover, techniques such as few-shot learning, meta-learning, and active learning will reduce the reliance on large labeled datasets. This will expand the applicability of SFT to domains with limited labeled data availability, making AI models more practical and accessible. The future of SFT will focus on interpretability and explainability, with models grow in complexity, the ability to interpret and explain their decisions becomes crucial. Researchers will develop techniques that provide transparent explanations for the behavior of fine-tuned models to understand and validate the decisions made by the models, fostering trust and acceptance. Ensuring the robustness and safety of fine-tuned models rest a critical concern and researchers will explore techniques to enhance the models’ resilience against adversarial attacks, spurious correlations, and unforeseen situations. Incorporating mechanisms for uncertainty estimation and risk assessment will contribute to the development of more reliable and secure AI systems.</p>
<h3 id="references">References</h3>
<ul>
<li><a href="https://medium.datadriveninvestor.com/lima-efficient-large-language-model-with-supervised-finetuning-bad42f7a48a6">Power of Supervised Finetuning with Open Source Large Language Models(LLMs)</a></li>
<li>Holstein, K., Wortman Vaughan, J., Daumé III, H., Dudík, M., & Wallach, H. (2019). Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp. 1-16). doi: 10.1145/3290605.3300830</li>
<li>Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., … & Gebru, T. (2019). Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 220-229). doi: 10.1145/3287560.3287596</li>
<li>Arrieta, A. B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., … & Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82-115. doi: 10.1016/j.inffus.2019.12.012</li>
<li>Honegger, A., & Passweg, D. (2021). Supervised fine-tuning for controlled and responsible AI. arXiv preprint arXiv:2106.11539.</li>
<li>Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389-399. doi: 10.1038/s42256-019-0088-2</li>
<li>Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., … & Herrera, F. (2020). Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58, 82-115. doi: 10.1016/j.inffus.2019.12.012</li>
<li>Campolo, A., Sanfilippo, M., Whittaker, M., & Crawford, K. (2017). AI Now 2017 Report. AI Now Institute at New York University. Retrieved from https://ainowinstitute.org/AI_Now_2017_Report.pdf</li>
<li>Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1). doi: 10.1162/99608f92.8cd550d1</li>
</ul>
Sun, 26 Nov 2023 00:00:00 +0000
/posts/2023-11-26/SFT.html
Artificial intelligenceResponsible AISFTpostsBayesian Gaussian Process Regression (GPR)<h3 id="gaussian-process-regression-gpr">Gaussian Process Regression (GPR)</h3>
<p>Gaussian process regression (GPR) is a machine learning method based on non-parametric regression method that can be used to fit arbitrary scalar and vectorial quantities. GPR provides a probabilistic model that can be used to make predictions and estimate the uncertainty of those predictions. A Gaussian process is a generalization of the Gaussian probability distribution to functions, where any finite set of function values has a joint Gaussian distribution. The mean function and covariance function of the Gaussian process describe the prior distribution of the function, and the observations are used to update the prior to the posterior distribution of the function. In GPR, the output variable is assumed to be a function of the input variables, and the function is modeled as a sample from a Gaussian process. The goal is to predict the value of the output variable at a new input point, given the observed data. The predicted value is given by the posterior mean of the Gaussian process, and the uncertainty of the prediction is given by the posterior variance. GPR is particularly useful when the data is noisy or when the function being modeled is complex and nonlinear. The key advantages of GPR over other regression techniques are its flexibility and its ability to provide a probabilistic framework for uncertainty quantification. GPR can be used for both regression and classification problems, and it can handle both scalar and vector-valued outputs. Moreover, GPR can be easily extended to handle non-stationary and non-Gaussian data. In practice, GPR is often implemented using the kernlab or gpflow packages in R or Python, respectively. These packages provide functions for specifying the kernel function, which is used to model the covariance between the input variables, and for estimating the hyperparameters of the kernel function using maximum likelihood or Bayesian methods.</p>
<h3 id="overfitting">Overfitting</h3>
<p>Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor generalization to new data. Like some other machine learning techniques, GPR is prone to overfitting if the model is too complex relative to the amount of data available. Specifically, if the number of hyperparameters of the Gaussian process model is large, or if the covariance function is too flexible, the model may fit the noise in the data rather than the underlying signal. This can result in poor generalization performance, where the model performs well on the training data but poorly on new, unseen data. To mitigate the risk of overfitting in GPR, it is important to carefully select the kernel function and the hyperparameters of the model based on the available data. Cross-validation can be used to estimate the generalization error of the model and to select the optimal values of the hyperparameters. Regularization techniques, such as adding a prior distribution on the hyperparameters or using Bayesian model selection, can also be used to prevent overfitting. Another way to prevent overfitting in GPR is to use a simpler covariance function that captures the key features of the data, rather than trying to fit the noise in the data. Overall, while GPR is a powerful and flexible regression technique, it requires careful tuning of the hyperparameters and selection of the kernel function to prevent overfitting and achieve good generalization performance.</p>
<h3 id="healthcare">Healthcare</h3>
<p>In recent years, the field of healthcare had an increase in the use of machine learning techniques to improve patient care, optimize treatment plans, and enhance medical decision-making. Along these lines, GPR can find diverse applications in healthcare. GPR could offer several benefits in healthcare, for instance, it provides uncertainty quantification, allowing healthcare professionals to assess the reliability of predictions and make informed decisions. Moreover, GPR demonstrates flexibility and adaptability, its non-parametric nature enhances versatility in predictive modeling. Lastly, GPR showcases data efficiency, accurately predicting outcomes even with limited data points. This feature is particularly valuable in healthcare where data collection can be challenging and costly, making GPR an optimal choice for applications with limited data availability.</p>
<p>Despite its numerous benefits, GPR use in healthcare comes with certain challenges and limitations that should be considered. Computational complexity poses a significant challenge, particularly with large datasets, necessitating efficient algorithms and computational resources to handle the complexity. Hyperparameter tuning is another consideration, involving the selection of optimal values for parameters such as the kernel function and noise level. This task can be challenging and may require expert knowledge or extensive experimentation. Furthermore, as GPR models complex relationships, the interpretability of the learned models can become intricate. Understanding the underlying factors contributing to predictions becomes more challenging in highly nonlinear models. These challenges highlight the need for careful consideration and expertise when applying GPR in healthcare settings. GPRs ability to model complex relationships, estimate uncertainties, and provide interpretable predictions makes it an invaluable asset for predictive modeling in healthcare, with a postential to enhance disease progression modeling, personalize treatment plans, detect diseases early, and improve medical imaging analysis. While challenges exist, ongoing research and advancements in computational techniques are addressing these limitations, making GPR an increasingly valuable tool in healthcare. As the field continues to evolve, GPR is poised to revolutionize healthcare by enabling more accurate predictions, better decision-making, and improved patient outcomes.</p>
<h3 id="code">Code</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Load necessary packages
library(kernlab)
library(GPfit)
library(ggplot2)
# Generate simulated data
set.seed(123)
x <- seq(0, 10, length = 50)
y <- sin(x) + rnorm(50, 0, 0.2)
df <- data.frame(x = x, y = y)
# Fit Gaussian process regression model
gpr_model <- gausspr(y ~ x, data = df)
y_pred <- predict(gpr_model, x)
# Visualize results
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(y = y_pred), color = "red") +
labs(title = "Gaussian Process Regression", x = "x", y = "y")
</code></pre></div></div>
<p>This R code performs Gaussian process regression (GPR) on simulated data and visualizes the results. Let’s break down each part of the code step-by-step:</p>
<ol>
<li>Load Necessary Packages:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">kernlab</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">GPfit</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>This part loads the required R packages: <code class="language-plaintext highlighter-rouge">kernlab</code> for kernel-based machine learning, <code class="language-plaintext highlighter-rouge">GPfit</code> for Gaussian process modeling, and <code class="language-plaintext highlighter-rouge">ggplot2</code> for data visualization.</p>
<ol>
<li>Generate Simulated Data:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Simulated data is generated for the predictor variable <code class="language-plaintext highlighter-rouge">x</code> and the response variable <code class="language-plaintext highlighter-rouge">y</code>. The <code class="language-plaintext highlighter-rouge">x</code> values are generated as a sequence from 0 to 10 with 50 points. The <code class="language-plaintext highlighter-rouge">y</code> values are generated by taking the sine of each <code class="language-plaintext highlighter-rouge">x</code> value and adding random noise from a normal distribution with mean 0 and standard deviation 0.2. The data is then combined into a data frame <code class="language-plaintext highlighter-rouge">df</code>.</p>
<ol>
<li>Fit Gaussian Process Regression Model:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gpr_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gausspr</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>A Gaussian process regression model is fitted using the <code class="language-plaintext highlighter-rouge">gausspr</code> function from the <code class="language-plaintext highlighter-rouge">GPfit</code> package. The model specification is <code class="language-plaintext highlighter-rouge">y ~ x</code>, indicating that we want to model <code class="language-plaintext highlighter-rouge">y</code> as a function of <code class="language-plaintext highlighter-rouge">x</code> using Gaussian process regression.</p>
<ol>
<li>Predict Values of y and Visualize Results:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y_pred</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">gpr_model</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y_pred</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"red"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Gaussian Process Regression"</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"x"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"y"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>This part predicts the values of the response variable <code class="language-plaintext highlighter-rouge">y_pred</code> for the predictor variable <code class="language-plaintext highlighter-rouge">x</code> using the fitted Gaussian process regression model. The <code class="language-plaintext highlighter-rouge">predict</code> function is used to make the predictions based on the model <code class="language-plaintext highlighter-rouge">gpr_model</code>.</p>
<p>The results are then visualized using <code class="language-plaintext highlighter-rouge">ggplot2</code>. A scatter plot of the original data points (<code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code>) is created with blue points (<code class="language-plaintext highlighter-rouge">geom_point()</code>). Overlaid on the scatter plot is a red line representing the predictions of the response variable (<code class="language-plaintext highlighter-rouge">y_pred</code>) from the Gaussian process regression model (<code class="language-plaintext highlighter-rouge">geom_line(aes(y = y_pred), color = "red")</code>).</p>
<h3 id="bayesian">Bayesian</h3>
<p>Gaussian process regression (GPR) can also be implemented in a Bayesian context using Stan. In Bayesian GPR, we assume a prior distribution for the unknown function and then update our beliefs about the function based on the observed data. The prior distribution is typically specified as a Gaussian process with a mean function and covariance function that depend on hyperparameters. The likelihood function for the observed data is also assumed to be Gaussian with a mean function equal to the prior mean function and a covariance function equal to the sum of the prior covariance function and a noise term. The hyperparameters of the prior and likelihood functions are estimated from the data using Markov chain Monte Carlo (MCMC) methods.</p>
<p>Here is an example of R code for fitting a Bayesian GPR model using Stan. Let’s break down each part of the code step-by-step:</p>
<ol>
<li>Generate Simulated Data:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>Simulated data is generated for the predictor variable <code class="language-plaintext highlighter-rouge">x</code> and the response variable <code class="language-plaintext highlighter-rouge">y</code>. The <code class="language-plaintext highlighter-rouge">x</code> values are generated as a sequence from 0 to 10 with 50 points. The <code class="language-plaintext highlighter-rouge">y</code> values are generated by taking the sine of each <code class="language-plaintext highlighter-rouge">x</code> value and adding random noise from a normal distribution with mean 0 and standard deviation 0.2. The data is then combined into a data frame <code class="language-plaintext highlighter-rouge">df</code>.</p>
<ol>
<li>Specify Stan Model Code:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stan_model_code</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">" ... "</span><span class="w">
</span></code></pre></div></div>
<p>The Stan model code is specified as a character string. The model defines the data, parameters, and the statistical model for Bayesian GPR. It uses a Gaussian process kernel to model the relationship between the predictor variable <code class="language-plaintext highlighter-rouge">x</code> and the response variable <code class="language-plaintext highlighter-rouge">y</code>. The parameters <code class="language-plaintext highlighter-rouge">mu</code>, <code class="language-plaintext highlighter-rouge">sigma_f</code>, <code class="language-plaintext highlighter-rouge">sigma_n</code>, and <code class="language-plaintext highlighter-rouge">eta</code> represent the mean function, the covariance function for the underlying Gaussian process, the noise standard deviation, and the latent function values, respectively.</p>
<ol>
<li>Compile Stan Model:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gpr_stan_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">model_code</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_model_code</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The Stan model is compiled using the <code class="language-plaintext highlighter-rouge">stan_model</code> function from the <code class="language-plaintext highlighter-rouge">rstan</code> package. This step converts the Stan model code into a C++ program that will be used for Bayesian inference.</p>
<ol>
<li>Prepare Data for Stan Model:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stan_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">x2</span><span class="o">=</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">df</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w">
</span><span class="n">N</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">),</span><span class="w">
</span><span class="n">N2</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">))</span><span class="w">
</span></code></pre></div></div>
<p>The data is prepared as a list <code class="language-plaintext highlighter-rouge">stan_data</code> with the number of rows <code class="language-plaintext highlighter-rouge">N</code>, the predictor variable <code class="language-plaintext highlighter-rouge">x</code>, and the response variable <code class="language-plaintext highlighter-rouge">y</code>. This data will be used as input to the Stan model during sampling.</p>
<ol>
<li>Fit Bayesian GPR Model using Stan:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gpr_fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">gpr_stan_model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_data</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The Bayesian GPR model is fitted using the <code class="language-plaintext highlighter-rouge">sampling</code> function from <code class="language-plaintext highlighter-rouge">rstan</code>. This step performs Markov chain Monte Carlo (MCMC) sampling to estimate the posterior distribution of the model parameters.</p>
<ol>
<li>Extract Posterior Samples of f for Prediction:</li>
</ol>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">f_samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">extract</span><span class="p">(</span><span class="n">gpr_fit</span><span class="p">,</span><span class="w"> </span><span class="s2">"f"</span><span class="p">)</span><span class="o">$</span><span class="n">f</span><span class="w">
</span><span class="n">sigma_samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">extract</span><span class="p">(</span><span class="n">gpr_fit</span><span class="p">,</span><span class="w"> </span><span class="s2">"sigma"</span><span class="p">)</span><span class="o">$</span><span class="n">sigma</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">Ef</span><span class="o">=</span><span class="n">colMeans</span><span class="p">(</span><span class="n">f_samples</span><span class="p">),</span><span class="w">
</span><span class="n">sigma</span><span class="o">=</span><span class="n">mean</span><span class="p">(</span><span class="n">sigma_samples</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">))</span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s2">"Time (ms)"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">"Acceleration (g)"</span><span class="p">)</span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Ef</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">)</span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Ef</span><span class="m">-2</span><span class="o">*</span><span class="n">sigma</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">,</span><span class="n">linetype</span><span class="o">=</span><span class="s2">"dashed"</span><span class="p">)</span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Ef</span><span class="m">+2</span><span class="o">*</span><span class="n">sigma</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">,</span><span class="n">linetype</span><span class="o">=</span><span class="s2">"dashed"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">extract</code> function is used to extract the posterior samples of the latent function <code class="language-plaintext highlighter-rouge">f</code> from the fitted GPR model. These samples will be used to predict new values of <code class="language-plaintext highlighter-rouge">f</code> for new values of <code class="language-plaintext highlighter-rouge">x</code>.</p>
<p>The code above demonstrates how to perform Bayesian GPR on simulated data, which is useful for modeling non-linear relationships between variables and making predictions with uncertainty estimates. Bayesian GPR provides a flexible framework for dealing with complex and noisy data, making it a powerful tool for various data analysis tasks.</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">rstan</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="c1"># Generate simulated data</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">123</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sin</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># Specify Stan model code</span><span class="w">
</span><span class="n">stan_model_code</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="s2">"
functions {
vector gp_pred_rng(array[] real x2,
vector y1,
array[] real x1,
real sigma_f,
real lengthscale_f,
real sigma,
real jitter) {
int N1 = rows(y1);
int N2 = size(x2);
vector[N2] f2;
{
matrix[N1, N1] L_K;
vector[N1] K_div_y1;
matrix[N1, N2] k_x1_x2;
matrix[N1, N2] v_pred;
vector[N2] f2_mu;
matrix[N2, N2] cov_f2;
matrix[N1, N1] K;
K = gp_exp_quad_cov(x1, sigma_f, lengthscale_f);
for (n in 1:N1)
K[n, n] = K[n,n] + square(sigma);
L_K = cholesky_decompose(K);
K_div_y1 = mdivide_left_tri_low(L_K, y1);
K_div_y1 = mdivide_right_tri_low(K_div_y1', L_K)';
k_x1_x2 = gp_exp_quad_cov(x1, x2, sigma_f, lengthscale_f);
f2_mu = (k_x1_x2' * K_div_y1);
v_pred = mdivide_left_tri_low(L_K, k_x1_x2);
cov_f2 = gp_exp_quad_cov(x2, sigma_f, lengthscale_f) - v_pred' * v_pred;
f2 = multi_normal_rng(f2_mu, add_diag(cov_f2, rep_vector(jitter, N2)));
}
return f2;
}
}
data {
int<lower=1> N; // number of observations
vector[N] x; // univariate covariate
vector[N] y; // target variable
int<lower=1> N2; // number of test points
vector[N2] x2; // univariate test points
}
transformed data {
// Normalize data
real xmean = mean(x);
real ymean = mean(y);
real xsd = sd(x);
real ysd = sd(y);
array[N] real xn = to_array_1d((x - xmean)/xsd);
array[N2] real x2n = to_array_1d((x2 - xmean)/xsd);
vector[N] yn = (y - ymean)/ysd;
real sigma_intercept = 1;
vector[N] zeros = rep_vector(0, N);
}
parameters {
real<lower=0> lengthscale_f; // lengthscale of f
real<lower=0> sigma_f; // scale of f
real<lower=0> sigman; // noise sigma
}
model {
// covariances and Cholesky decompositions
matrix[N, N] K_f = gp_exp_quad_cov(xn, sigma_f, lengthscale_f)+
sigma_intercept^2;
matrix[N, N] L_f = cholesky_decompose(add_diag(K_f, sigman^2));
// priors
lengthscale_f ~ normal(0, 1);
sigma_f ~ normal(0, 1);
sigman ~ normal(0, 1);
// model
yn ~ multi_normal_cholesky(zeros, L_f);
}
generated quantities {
// function scaled back to the original scale
vector[N2] f = gp_pred_rng(x2n, yn, xn, sigma_f, lengthscale_f, sigman, 1e-9)*ysd + ymean;
real sigma = sigman*ysd;
}
"</span><span class="w">
</span><span class="c1"># Compile Stan model</span><span class="w">
</span><span class="n">gpr_stan_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">model_code</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_model_code</span><span class="p">)</span><span class="w">
</span><span class="c1"># Prepare data for Stan model</span><span class="w">
</span><span class="n">stan_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">x2</span><span class="o">=</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w">
</span><span class="n">y</span><span class="o">=</span><span class="n">df</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w">
</span><span class="n">N</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">),</span><span class="w">
</span><span class="n">N2</span><span class="o">=</span><span class="nf">length</span><span class="p">(</span><span class="n">df</span><span class="o">$</span><span class="n">x</span><span class="p">))</span><span class="w">
</span><span class="c1"># Fit Bayesian GPR model using Stan</span><span class="w">
</span><span class="n">gpr_fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">gpr_stan_model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_data</span><span class="p">)</span><span class="w">
</span><span class="n">f_samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">extract</span><span class="p">(</span><span class="n">gpr_fit</span><span class="p">,</span><span class="w"> </span><span class="s2">"f"</span><span class="p">)</span><span class="o">$</span><span class="n">f</span><span class="w">
</span><span class="n">sigma_samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">extract</span><span class="p">(</span><span class="n">gpr_fit</span><span class="p">,</span><span class="w"> </span><span class="s2">"sigma"</span><span class="p">)</span><span class="o">$</span><span class="n">sigma</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">Ef</span><span class="o">=</span><span class="n">colMeans</span><span class="p">(</span><span class="n">f_samples</span><span class="p">),</span><span class="w">
</span><span class="n">sigma</span><span class="o">=</span><span class="n">mean</span><span class="p">(</span><span class="n">sigma_samples</span><span class="p">))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">y</span><span class="p">))</span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="o">+</span><span class="w">
</span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s2">"Time (ms)"</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="s2">"Acceleration (g)"</span><span class="p">)</span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Ef</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">)</span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Ef</span><span class="m">-2</span><span class="o">*</span><span class="n">sigma</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">,</span><span class="n">linetype</span><span class="o">=</span><span class="s2">"dashed"</span><span class="p">)</span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="n">Ef</span><span class="m">+2</span><span class="o">*</span><span class="n">sigma</span><span class="p">),</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="s1">'red'</span><span class="p">,</span><span class="n">linetype</span><span class="o">=</span><span class="s2">"dashed"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<h3 id="references">References</h3>
<ul>
<li>Duvenaud, D. K., Nickisch, H., & Rasmussen, C. E. (2013). Gaussian processes for machine learning: tutorial. In S. Sra, S. Nowozin, & S. J. Wright (Eds.), Optimization for Machine Learning (pp. 133-181). MIT Press.</li>
<li>Nguyen, T. D., & Nguyen, T. T. (2018). Multi-task Gaussian process models for biomedical applications. arXiv preprint arXiv:1806.03836.</li>
<li>Alaa, A. M., & van der Schaar, M. (2018). Prognostication and risk factors for cystic fibrosis via automated machine learning and Gaussian process regression. Scientific Reports, 8(1), 1-12.</li>
<li>Nguyen, T. T., Nguyen, H. T., Nguyen, T. L., & Chetty, G. (2017). Gaussian process regression for predicting 30-day readmission of heart failure patients. Journal of Biomedical Informatics, 71, 199-209.</li>
<li>Kazemi, S., & Soltanian-Zadeh, H. (2013). A new Gaussian process regression-based method for segmentation of brain tissues from MRI. Medical Image Analysis, 17(3), 225-234.</li>
<li><a href="https://avehtari.github.io/casestudies/Motorcycle/motorcycle_gpcourse.html">Gaussian process demonstration with Stan</a></li>
</ul>
<p><a href="https://gist.github.com/tjmahr/329271d16cc3ff95fb9c82be5768b4ab"></a></p>
Sat, 28 Oct 2023 00:00:00 +0000
/posts/2023-10-28/GPR.html
STANRBayesposts