A Dirichlet Process Implementation in Julia
The following code closely follows the lecture notes and examples found here which provide excellent reading on Dirichlet Processes. This post is less about DP’s themselves and more about what an implementation in Julia might look like. The example here is the “Traffic” dataset from the “MASS” library for which the description reads
An experiment was performed in Sweden in 1961-2 to assess the effect of a speed limit on the motorway accident rate. The experiment was conducted on 92 days in each year, matched so that day j in 1962 was comparable to day j in 1961. On some days the speed limit was in effect and enforced, while on other days there was no speed limit and cars tended to be driven faster. The speed limit days tended to be in contiguous blocks.
Let’s load the dataset into Julia and inspect histograms of the number of accidents each day when the speed limit is and is not enforced.
using RDatasets
using Gadfly
using Distributions
using Iterators
speed=dataset("MASS","Traffic")
limityes=speed[speed[:Limit].=="yes",:Y]
nyes=length(limityes)
limitno=speed[speed[:Limit].=="no",:Y]
nno=length(limitno)
plot(speed, x="Y", color="Limit", Geom.histogram(bincount=20,density=true))
A natural question to ask is what is the posterior probability that the number of accidents is less on a day when the speed limit is enforced compared to when it is. To answer this one could build a statistical model using a Dirichlet process.
Dirichlet Process Model Specification
and c a normalizing constant so that the weights sum to one.
A direct scan gibbs sampler could then be implemented as follows
function randDP(loglikelihood,logmarginallikelihood,posterior,Y,a,Θin)
Θ=deepcopy(Θin)
for i = 1:length(Y)
weights = map(x -> loglikelihood(Y[i],x),Θ)
weights[i] = logmarginallikelihood(Y[i])
weights = exp(weights-maximum(weights))
weights[i] = weights[i]*a
weights = weights/sum(weights)
select = rand(Categorical(weights))
Θ[i] = (select == i ) ? rand(posterior(Y[i])) : Θ[select]
end
Θ
end
For the swedish speed limit dataset, natural assumptions would be that the function g is a Poisson pmf and the base measure is Gamma(b_{1},b_{2})
Gibbs Sampling
function partial(f,a...)
(b...) -> f(a...,b...)
end
b1=2
b2=0.1
a=1
PoissonGammaDP = partial(randDP,
(y,θ)->logpdf(Poisson(θ),y),
y->logpdf(NegativeBinomial(b1,b2/(b2+1)),y),
y->Gamma(b1+y,b2+1))
mcmcyes = iterate(partial(PoissonGammaDP,
limityes,a),
ones(length(limityes)))
mcmcno = iterate(partial(PoissonGammaDP,
limitno,a),
ones(length(limitno)))
drawsyes=collect(takenth(take(drop(mcmcyes,1000),10000),10))
drawsno=collect(takenth(take(drop(mcmcno,1000),10000),10))
distsyes=map( x -> MixtureModel([MixtureModel(map(y -> Poisson(y),x)),NegativeBinomial(b1,b2/(b2+1))],[nyes/(a+nyes), a/(a+nyes)]), drawsyes)
distsno=map( x -> MixtureModel([MixtureModel(map(y -> Poisson(y),x)),NegativeBinomial(b1,b2/(b2+1))],[nno/(a+nno), a/(a+nno)]), drawsno)
mean(x -> x[1]<x[2],zip(map(x->rand(x),distsyes),map(x->rand(x),distsno)))
0.603
The iterate function from the Iterators.jl package forms a lazy evaluated list of MCMC iterates. In a functional style the first 1000 iterates can be dropped as a burnin. One can then “take” 10000 iterates from the chain (allbeit lazily) and therefrom take every 10’th iterate using the takenth function. The collect function then forces the evaluation of the lazily evaluated list. Each iterate consists of an $\theta$ array which can be used to form a posterior predictive distribution. Hence one can map a function that takes a $\theta$ array and returns a posterior predictive distribution over the mcmc draws to get an array of posterior predictive distributions, which is achieved in distsyes and distsno respectively. One can then generate random variates from each element of these arrays in turn, zip them together, and supply a predicate which evaluates true whenever the second element in a pair is greater than the first. Computing the mean of this boolean array then gives a monte carlo estimate of the probability that the number of accidents on a day when there is no speed limit is greater than the number of accidents on a day when there is no speed limit as 0.603.
Visualization
Here’s a function to plot the mean posterior predictive pmf’s with 95% credible intervals.
function mcmcdataframe(draws,gitter,name)
dists=map( x -> MixtureModel([MixtureModel(map(y -> Poisson(y),x)),NegativeBinomial(b1,b2/(b2+1))],[nyes/(a+nyes), a/(a+nyes)]), draws)
gridmatrix=hcat(map(z -> pdf(z,gitter),dists)...)'
meanfunction=map(grid -> mean(gridmatrix[:,grid]),gitter)
upperfunction=map(grid -> quantile(gridmatrix[:,grid],0.975),gitter)
lowerfunction=map(grid -> quantile(gridmatrix[:,grid],0.025),gitter)
df_DP = DataFrame(
x=gitter,
y=meanfunction,
ymin=lowerfunction,
ymax=upperfunction,
Speed_Limit=name
)
end
df_mcmc=vcat(mcmcdataframe(drawsno,1:60,"No"),mcmcdataframe(drawsyes,1:60,"Yes"))
plot(df_mcmc, x=:x, y=:y, ymin=:ymin, ymax=:ymax, color=:Speed_Limit, Geom.line, Geom.ribbon)
Leave a Comment