您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python uses MCMC Markov chain Monte Carlo, rejection sampling and metropolis Hastings sampling algorithms

編輯：Python

Link to the original text ：http://tecdat.cn/?p=27267

The source of the original text is ： The official account of the tribal public

We will study two methods of sampling distributions ： Reject sampling and use Metropolis Hastings Markov chain Monte Carlo method of algorithm (MCMC). As usual , I will provide an intuitive explanation 、 Theory and some examples with code .

background

Before we get to the subject , Let's take Markov chain Monte Carlo （MCMC） The term is broken down into its basic components ： Monte Carlo method and Markov chain .

Markov chain

Markov Chain yes “ A random process that undergoes a transition from one state to another in state space ”.

As you can see , It looks like a finite state machine , We just annotate the state transition with probability . for example , We can see if it's sunny today , The probability of sunny tomorrow is 0.9, The probability of rain is 0.1. It also starts on a rainy day . It should be clear that , From a given state , All outgoing conversions should total 1.0, Because it is a proper distribution .

Another way to represent this information is through the transfer matrix P：

The interesting thing about expressing it as a matrix is , We can simulate Markov chains by matrix multiplication . for example , Suppose we start from a sunny state , We can express it as a row vector ：x(0)=[10]. This implies that our probability of being in a sunny state is 1, So the probability of being in a rainy state is 0. Now? , If we perform matrix multiplication , We can find out the probability of being in each state after one step ：

We can see that there will be 0.9 The chance of a sunny day （ According to our simple model ）, Yes 0.1 The chance to rain . actually , We can continue to multiply the transformation matrix , In the k Step and find the sun / Opportunities for rain ：

We can easily calculate x(k) Various k value , Use numpy：

import numpy as np
Pa = nps.ardraasy([[0.9, 0.1], [0.5, 0.5]])
istsdatea = np.arasdray([1, 0])
siasdmulatase_markasdov(istaasdte, P, 10)

 We can see an interesting phenomenon here , When we take more steps in the state machine , The probability of sunny or rainy days seems to converge . You may think this is related to our initial state , But it's not . If we initialize the initial state to a random value , We will get the same result ：


siasmdulasteds_marksov(nap.arsdray([r, 1 - r]), P, 10)
x^(0) = [0.3653 0.6347]
x^(1) = [0.6461 0.3539]
x^(2) = [0.7584 0.2416]
x^(3) = [0.8034 0.1966]
x^(4) = [0.8214 0.1786]
x^(5) = [0.8285 0.1715]
x^(6) = [0.8314 0.1686]
x^(7) = [0.8326 0.1674]
x^(8) = [0.8330 0.1670]
x^(9) = [0.8332 0.1668]

This steady-state distribution is called stationary distribution Usually use π Express . The steady state vector can be found in many ways π. The most direct way is in nn Take the limit when approaching infinity .

The next method is to solve the equation . Because by definition q It's steady state , So multiply by P Should return the same value ：

among I It's a unit matrix . If you extend our vector / Matrix symbol , You will find that this is just a set of equations and π1,π2,...,πn Sum to 1 ( namely π Form a probability distribution ）. In our example, there are only two states ：π1+π2=1.

However , Not every Markov chain has a stable distribution , Even the only one however , If we add two additional constraints to the Markov chain , We can guarantee these properties ：

Irreducible ： We must be able to eventually get from any other state to any one state （ The number of immediate steps is limited ）.
Aperiodicity ： The system will never return to the same state with a fixed period （ for example , Not every time 5 Step returns deterministically to the beginning “ a sunny day ”）.

An important theorem says , If the Markov chain is ergodic , So it has a unique steady-state probability vector ππ. stay MCMC In the context of , We can jump from any state to any other state （ With a certain finite probability ）, Easily satisfy irreducibility .

Another useful definition we will use is detailed balance and reversible Markov Chains. If there is a probability distribution that satisfies this condition π, Markov chains are said to be reversible （ Also known as detailed equilibrium conditions ）：

let me put it another way , In the long run , You from the State i Move to state j The proportion of times , With you from the State j Move to state i The proportion of times is the same . in fact , If the Markov chain is reversible , Then we know that it has a stationary distribution （ That's why we use the same symbols π Why ）.

Markov chain Monte Carlo method

Markov chain Monte Carlo (MCMC) The method is just a class of using Markov chains from a specific probability distribution （ Monte Carlo part ） Sampling algorithm in . They work by creating a Markov chain , Where the distribution is restricted （ Or smooth distribution ） It's just the distribution we want to sample .

This is a picture that may help to describe the process . Imagine , We are trying to make a MCMC To try to use PDF f(x) Sample any one-dimensional distribution . under these circumstances , Our state will be along x Point of axis , And our transition probability will be the opportunity to go from one state to another . This is a simplified diagram of the situation ：

This figure shows the density function we are trying to approximate with a thick black line , And visualization of a part of a Markov chain using a blue line transiting from an orange state . especially , about i=-3,-2,-1,1,2,3, Just from the State X0 To Xi Transformation . however ,x Every point on the axis is actually a potential in this Markov chain . Please note that , That means we have an infinite state space , So we can no longer express the transformation well as a matrix .MCMC The real “ knack ” We want the design state （ or x The point on the axis ） Conversion probability between , So that we can spend most of our time f(x) In a large area , And there is relatively little time in its smaller area （ Is precisely proportional to our density function ）.

As far as our characters are concerned , We want to spend most of our time around the center , Less time spent outside . in fact , If we simulate our Markov chain long enough , The restricted distribution of States should be similar to what we are trying to sample PDF. therefore , Use MCMC Method the basic algorithm for sampling is ：

From any point x Start .
Jump to point with a certain transition probability x'（ This may mean keeping the same state ）.
Turn to the first 2 Step , Until we switched T Time .
Record the current status x′, Go ahead with the steps 2.

Now? , We are at every point x The proportional number of times spent on the axis should be what we are trying to simulate PDF Approximate value , That is, if we draw x Histogram of values , We should get the same shape .

Reject sampling

Now? , Before we enter MCMC Method , I would like to introduce another method of sampling probability distribution , We'll use it later , be called rejection sampling. The main idea is , If we try to get from the distribution f(x) In the sample , We will use another tool to distribute g(x) To help from f(x) In the sample . The only limitation is for some M>1,f(x)<Mg(x). Its main use is when f(x) When it is difficult to sample directly （ But it can still be at any point xx Evaluate it ）.

The following is a breakdown of the algorithm ：

from g(x) In the sample x.
from U(0,Mg(x)) In the sample y（ Uniform distribution ）.
If y<f(x), received x As f(x) The sample of , Otherwise go to step 1.

Another way to look at it is at our sampling point x0 Probability . This is the same as from g In the sample x0 Is proportional to the number of times we accept , It simply consists of f(x0) and Mg(x0) The ratio between ：

The equation tells us that the probability of sampling any point is equal to f(x0) In direct proportion to . After sampling many points and finding what we see x0 After the ratio of times , constant M Normalized , We got PDF f(x) The right result .

Let's take a more intuitive look at it through an example . The target distribution from which we want to sample f(x) yes double gamma Distribution , It's basically a two-sided gamma distribution . We will use the normal distribution g(x) As our envelope distribution . The following code shows us how to find the scaling constant M, It also describes how rejection sampling works conceptually .


# The goal is = Double gamma distribution
dsg = stats.dgamma(a=1)
# Generate PDF The sample of
x = np.linspace
# mapping
ax = df.plot(style=['--', '-']

From the picture , Once we find g(x) The sample of （ under these circumstances x=2）, We are equal to from the range Mg(x) Drawn in a uniform distribution of height . If it's on target PDF Within the height of , We accept it （ green ）, Otherwise, refuse （ Refuse ）.

Implement reject sampling

The following code implements reject sampling for our target dual gamma distribution . It draws standardized histograms and compares them with the theory we should get PDF matching .


# Generate samples from the reject sampling algorithm
sdampales = [rejeasdctioan_samplaing for x in range(10000)]
# Draw histogram and target PDF
df['Target'].plot

in general , Our reject sampler is very suitable for . Compared with the theoretical distribution , Taking more samples will improve the fitting .

A big part of reject sampling is that it's easy to implement （ stay Python Just a few lines of code ）, But there is one major drawback ： It's very slow .

Metropolis-Hastings Algorithm

this Metropolis-Hastings Algorithm (MH) It's a kind of MCMC technology , It extracts samples from probability distributions that are difficult to sample directly . Compared with refusal sampling , Yes MH The restrictions are actually more relaxed ： For a given probability density function p(x), We only ask that we have one And p In direct proportion to Function of f(x)f(x) (x)p(x)！ This is very useful when sampling posterior distributions .

Metropolis-Hastings Algorithm derivation

In order to derive Metropolis-Hastings Algorithm , Let's start with the final goal ： Create a Markov chain , Where the steady-state distribution is equal to our target distribution p(x). In terms of Markov chains , We already know what the state space will be ： Support of probability distribution , namely x value . therefore （ Suppose the construction of Markov chain is correct ） The final steady-state distribution we get will be just p(x). The rest is to determine these x The probability of conversion between values , So that we can achieve this steady-state behavior .

Detailed equilibrium conditions for Markov chains , Here it is written in another way ：

here p(x) Is our target distribution ,P(x→x′) From the point of x point-to-point x′ The probability of transfer . So our goal is to determine P(x→x′) In the form of . Since we want to build Markov chains , Let's start with equations 5 Start as the basis for this build . please remember , Detailed equilibrium conditions guarantee that our Markov chain will have a stable distribution （ It exists ）. Besides , If we also include ergodicity （ Do not repeat the status at regular intervals , And each state can eventually reach any other state ）, We will establish a unique stationary distribution p(x) The Markov chain of .

We can rearrange the equation to ：

Here we use f(x) To represent a And p(x) A proportional function . This is to emphasize that we do not have a clear need p(x), It just needs something proportional to it , In this way, the ratio can achieve the same effect . Now here's “ skill ” Yes, we will P(x→x′) Break it down into two separate steps ： A proposed distribution g(x→x′) And acceptance distribution A(x→x′)（ Similar to the working principle of reject sampling ）. Because they are independent , Our transition probability is just the product of the two ：

here , We must find out g(x) and A(x) What is the right choice for . because g(x) yes “ Recommended distribution ”, It determines the next point we may sample . therefore , The important thing is that it has a distribution with our target p(x)（ Ergodic condition ） Same support . A typical choice here is a normal distribution centered on the current state . Now let's give a fixed distribution of proposals g(x), We hope to find a match A(x).

Although it's not obvious , But it satisfies the formula Of A(x) The typical choice is ：

We can consider f(x′)g(x′→x) Less than or equal to 1 And greater than 1 The situation of . When less than or equal to 1 when , Its reciprocal is greater than 1, therefore LHS Denominator of ,A(x′→ x), equation 8 by 1, And the molecules are equal to RHS. perhaps , When f(x′)g(x′→x) It is greater than 1 LHS The molecule is 1, And the denominator happens to be RHS Reciprocal , Lead to LHS be equal to RHS.

such , We have proven that , The stable state of the Markov chain we create will be equal to our target distribution (p(x)), Because the detailed equilibrium conditions are satisfied by construction . So the overall algorithm will be （ With the above MCMC The algorithm matches very well ）：

By selecting a random x To initialize the initial state .
according to g(x→x′) Find a new x′.
according to A(x→x′) Accept with a uniform probability x′. If you accept x' Transformation , Otherwise keep the state x.
Go ahead 2 Step ,T Time .
State x Save as sample , Go to step 2 Sample another point .

Pre burn and related samples

Before we move on , We need to discuss about MCMC Two very important topics of methods . The first topic relates to the initial state we chose . Because we randomly choose xx Value , It is likely to be located in p(x) A very small area （ Think about the tail of our double gamma distribution ）. If you start here , It may take a disproportionate amount of time to traverse low-density x value , So as to give us a wrong feeling , That's all x Values should appear more frequently than they do . So the solution to this problem is “ Preburn ” The sampler generates a stack of samples and throws them away . The number of samples will depend on the details of the distribution we are trying to simulate .

The second problem we mentioned above is the correlation between two adjacent samples . Because according to our conversion function P(x→x′) The definition of , draw x′ Depending on the current state x. therefore , We have lost an important attribute of the sample ： independence . To correct this , We extract Tth Samples , And only the last sample taken is recorded . hypothesis T Large enough , The sample should be relatively independent . Same as pre burning ,T The value of depends on the distribution of goals and proposals .

Realization Metropolis-Hastings Algorithm

Let's use the double gamma distribution example above . Let's define our proposed distribution as x Centered 、 The standard deviation is 2、N(x, 2) Is a normal distribution （ remember x It's the current state ）：

Given f(x) With our basic distribution p(x) Is proportional to the , Our acceptance distribution is as follows ：

Because the normal distribution is symmetric , So normally distributed PDF They offset each other when evaluated at their respective points . Now let's look at some code ：


# The simulation is proportional to the double gamma distribution f(x)
f = ambd x: g.df(x* mat.i
Sampler = mhspler()
＃ sample
sames = [nex(saper) for x in range(10000)]
# Draw histogram and target PDF
df['Target'].plot

From us MH The sampler samples are well approximated to our double gamma distribution . Besides , View the autocorrelation diagram , We can see that it is very small in the whole sample , It shows that they are relatively independent . If we didn't do it for T Choose a good value or no pre burn period , We may see larger values in the graph .