机器学习 —— 概率图模型(Homework: StructuredCPD)

  Week2的作业主要是关于概率图模型的构造,主要任务可以分为两个部分:1、构造CPD;2、构造Graph。对于有向图而言,在获得单个节点的CPD之后就可依据图对Combine CPD进行构造。在获得Combine CPD之后则可利用变量的观测来进行问答。此周作业的大背景是对基因型与表现型之间的关系进行探索。在已知表现性的情况下对基因型以及下一代的基因进行推测。这是一个很有实际意义的有向图网络。

1、CPD构造

1.1、基因型与表现型的关系——确定

  在孟德尔遗传假说基础上,对双碱基配对的基因推测表现型与基因型的关系。也就是说,表现型是一个var,基因型是一个var,求factor. 由于表现型被显性基因或者隐性基因支配,所以显性,隐性与基因型之间是0,1关系。这是一种很特殊的CPD.

 function phenotypeFactor = phenotypeGivenGenotypeMendelianFactor(isDominant, genotypeVar, phenotypeVar)
% This function computes the probability of each phenotype given the
% different genotypes for a trait. It assumes that there is 1 dominant
% allele and 1 recessive allele.
%
% If you do not have much background in genetics, you should read the
% on-line Appendix or watch the Khan Academy Introduction to Heredity Video
% (http://www.khanacademy.org/video/introduction-to-heredity?playlist=Biology)
% before you start coding.
%
% For the genotypes, assignment 1 maps to homozygous dominant, assignment 2
% maps to heterozygous, and assignment 3 maps to homozygous recessive. For
% example, say that there is a gene with two alleles, dominant allele A and
% recessive allele a. Assignment 1 would map to AA, assignment 2 would
% make to Aa, and assignment 3 would map to aa. For the phenotypes,
% assignment 1 maps to having the physical trait, and assignment 2 maps to
% not having the physical trait.
%
% THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
% VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
%
% Input:
% isDominant: 1 if the trait is caused by the dominant allele (trait
% would be caused by A in example above) and 0 if the trait is caused by
% the recessive allele (trait would be caused by a in the example above)
% genotypeVar: The variable number for the genotype variable (goes in the
% .var part of the factor)
% phenotypeVar: The variable number for the phenotype variable (goes in
% the .var part of the factor)
%
% Output:
% phenotypeFactor: Factor denoting the probability of having each
% phenotype for each genotype phenotypeFactor = struct('var', [], 'card', [], 'val', []); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in phenotypeFactor.var. This should be a 1-D row vector.
% Fill in phenotypeFactor.card. This should be a 1-D row vector.
phenotypeFactor.var = [phenotypeVar genotypeVar];
phenotypeFactor.card= [2 3]; array = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card); if isDominant
phenotypeFactor.val = ones(1, prod(phenotypeFactor.card));
for i = 1:length(array)
if array(i,2)==3&&array(i,1)==1
phenotypeFactor.val(i)=0;
elseif ismember(array(i,2),[1 2])&&array(i,1)==2
phenotypeFactor.val(i)=0;
end
end
else
phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
for i = 1:length(array)
if array(i,2)==3&&array(i,1)==1
phenotypeFactor.val(i)=1;
end
end
end
end % Replace the zeros in phentoypeFactor.val with the correct values. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1.2 基因型与表现型的关系——不确定

  这可认为是非孟德尔遗传。也就是说,对于给定的基因型,表现型不一定出现,基因只影响致病概率。

function phenotypeFactor = phenotypeGivenGenotypeFactor(alphaList, genotypeVar, phenotypeVar)
% This function computes the probability of each phenotype given the
% different genotypes for a trait. Genotypes (assignments to the genotype
% variable) are indexed from 1 to the number of genotypes, and the alphas
% are provided in the same order as the corresponding genotypes so that the
% alpha for genotype assignment i is alphaList(i).
%
% For the phenotypes, assignment 1 maps to having the physical trait, and
% assignment 2 maps to not having the physical trait.
%
% THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
% VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
%
% Input:
% alphaList: Vector of alpha values for each genotype (n x 1 vector,
% where n is the number of genotypes) -- the alpha value for a genotype
% is the probability that a person with that genotype will have the
% physical trait
% genotypeVar: The variable number for the genotype variable (goes in the
% .var part of the factor)
% phenotypeVar: The variable number for the phenotype variable (goes in
% the .var part of the factor)
%
% Output:
% phenotypeFactor: Factor in which the val has the probability of having
% each phenotype for each genotype combination (note that this is the
% FULL CPD with no evidence observed) phenotypeFactor = struct('var', [], 'card', [], 'val', []); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
% The number of genotypes is the length of alphaList.
% The number of phenotypes is 2.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in phenotypeFactor.var. This should be a 1-D row vector.
% Fill in phenotypeFactor.card. This should be a 1-D row vector.
phenotypeFactor.var = [phenotypeVar genotypeVar];
phenotypeFactor.card= [2 3];
phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
% Replace the zeros in phentoypeFactor.val with the correct values.
assign_ = IndexToAssignment(1:prod(phenotypeFactor.card),[2 3]); for i = 1:prod(phenotypeFactor.card)
if assign_(i,1) == 1
switch assign_(i,2)
case 1
phenotypeFactor.val(i) = alphaList(1);
case 2
phenotypeFactor.val(i) = alphaList(2);
case 3
phenotypeFactor.val(i) = alphaList(3);
otherwise
'mistake in genotype'
return
end
else
switch assign_(i,2)
case 1
phenotypeFactor.val(i) = 1-alphaList(1);
case 2
phenotypeFactor.val(i) = 1-alphaList(2);
case 3
phenotypeFactor.val(i) = 1-alphaList(3);
otherwise
'mistake in genotype'
return
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1.3 表现型的分布

  表现型由基因决定,而基因是由统计决定的。也就是说,只观测到某个基因出现频率,要推测人群中表现型的概率

 function genotypeFactor = genotypeGivenAlleleFreqsFactor(alleleFreqs, genotypeVar)

 % alleleFreqs = [0.1; 0.9];
% genotypeVar = 1; % This function computes the probability of each genotype given the allele
% frequencies in the population. % Note that we assume that the copies of the gene are independent. Thus,
% knowing the allele for one copy of the gene does not affect the
% probability of having each allele for the other copy of the gene. As a
% result, the probability of a genotype is the product of the frequencies
% of its constituent alleles (or twice that product for heterozygous
% genotypes). % Input:
% alleleFreqs: An n x 1 vector of the frequencies of the alleles in the
% population, where n is the number of alleles
% genotypeVar: The variable number for the genotype (goes in the .var
% part of the factor)
%
% Output:
% genotypeFactor: Factor in which the val has the probability of having
% each genotype (note that this is the FULL CPD with no evidence
% observed) % The number of genotypes is (number of alleles choose 2) + number of
% alleles -- need to add number of alleles at the end to account for
% homozygotes genotypeFactor = struct('var', [], 'card', [], 'val', []);
numAlleles = length(alleleFreqs); % Each allele has an ID that is the index of its allele frequency in the
% allele frequency list. Each genotype also has an ID. We need allele and
% genotype IDs so that we know what genotype and alleles correspond to each
% probability in the .val part of the factor. For example, the first entry
% in .val corresponds to the probability of having the genotype with
% genotype ID 1, which consists of having two copies of the allele with
% allele ID 1. There is a mapping from a pair of allele IDs to genotype
% IDs and from genotype IDs to a pair of allele IDs below; we compute this
% mapping using generateAlleleGenotypeMappers(numAlleles). (A genotype
% consists of 2 alleles.) [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles); % One or both of these matrices might be useful.
%
% 1. allelesToGenotypes: n x n matrix that maps pairs of allele IDs to
% genotype IDs, where n is the number of alleles -- if
% allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of
% the alleles with IDs i and j
%
% 2. genotypesToAlleles: m x 2 matrix of allele IDs, where m is the
% number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the
% genotype with ID k is comprised of the allele with ID i and the allele
% with ID j %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in genotypeFactor.var. This should be a 1-D row vector.
% Fill in genotypeFactor.card. This should be a 1-D row vector.
genotypeFactor.var = genotypeVar;
genotypeFactor.card = length(genotypesToAlleles(:,1)); genotypeFactor.val = zeros(1, prod(genotypeFactor.card));
% Replace the zeros in genotypeFactor.val with the correct values.
n_ = length(alleleFreqs);
for i = 1:n_
for j = 1:n_
num_of_val = allelesToGenotypes(i,j);
genotypeFactor.val(num_of_val) = genotypeFactor.val(num_of_val)+alleleFreqs(i)*alleleFreqs(j) ;
end
end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1.4 父辈基因型与子辈基因型的关系

  每个基因拥有两个碱基,两位父辈对子辈进行等位遗传。也就是每个碱基遗传给子类的概率是相同的。那么父辈基因——子辈基因的关系则如下所示:

 function genotypeFactor = genotypeGivenParentsGenotypesFactor(numAlleles, genotypeVarChild, genotypeVarParentOne, genotypeVarParentTwo)

 % This function computes a factor representing the CPD for the genotype of
% a child given the parents' genotypes. % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
% VARIABLE IN THE .var FIELD FOR GRADING PURPOSES % When writing this function, make sure to consider all possible genotypes
% from both parents and all possible genotypes for the child. % Input:
% numAlleles: int that is the number of alleles
% genotypeVarChild: Variable number corresponding to the variable for the
% child's genotype (goes in the .var part of the factor)
% genotypeVarParentOne: Variable number corresponding to the variable for
% the first parent's genotype (goes in the .var part of the factor)
% genotypeVarParentTwo: Variable number corresponding to the variable for
% the second parent's genotype (goes in the .var part of the factor)
%
% Output:
% genotypeFactor: Factor in which val is probability of the child having
% each genotype (note that this is the FULL CPD with no evidence
% observed) % The number of genotypes is (number of alleles choose 2) + number of
% alleles -- need to add number of alleles at the end to account for homozygotes genotypeFactor = struct('var', [], 'card', [], 'val', []); % Each allele has an ID. Each genotype also has an ID. We need allele and
% genotype IDs so that we know what genotype and alleles correspond to each
% probability in the .val part of the factor. For example, the first entry
% in .val corresponds to the probability of having the genotype with
% genotype ID 1, which consists of having two copies of the allele with
% allele ID 1, given that both parents also have the genotype with genotype
% ID 1. There is a mapping from a pair of allele IDs to genotype IDs and
% from genotype IDs to a pair of allele IDs below; we compute this mapping
% using generateAlleleGenotypeMappers(numAlleles). (A genotype consists of
% 2 alleles.) [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles); % One or both of these matrices might be useful.
%
% 1. allelesToGenotypes: n x n matrix that maps pairs of allele IDs to
% genotype IDs, where n is the number of alleles -- if
% allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of
% the alleles with IDs i and j
%
% 2. genotypesToAlleles: m x 2 matrix of allele IDs, where m is the
% number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the
% genotype with ID k is comprised of the allele with ID i and the allele
% with ID j %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in genotypeFactor.var. This should be a 1-D row vector.
% Fill in genotypeFactor.card. This should be a 1-D row vector. genotypeFactor.var = [genotypeVarChild, genotypeVarParentOne, genotypeVarParentTwo];
var_num_ = length(genotypeFactor.var);
var_length_ = length(genotypesToAlleles(:,1));
genotypeFactor.card = linspace(var_length_,var_length_,var_num_);
genotypeFactor.val = zeros(1, prod(genotypeFactor.card));
assign_ = IndexToAssignment(1:prod(genotypeFactor.card),genotypeFactor.card);
% Replace the zeros in genotypeFactor.val with the correct values.
for i = 1:length(assign_)
GeType_child = assign_(i,1);
GeType_par1 = assign_(i,2);
GeType_par2 = assign_(i,3);
AllType_child = genotypesToAlleles(GeType_child,:);
AllType_par1 = genotypesToAlleles(GeType_par1,:);
AllType_par2 = genotypesToAlleles(GeType_par2,:);
combination = [];
length_tmp_allpar = length(AllType_par1);
for j = 1:length_tmp_allpar
for k = 1:length_tmp_allpar
combination = [combination;AllType_par1(j) AllType_par2(k)];
end
end
num_count = 0;
length_tmp_combination = length(combination);
for j = 1:length_tmp_combination
combin_row = combination(j,:);
if sort(AllType_child) == sort(combin_row)
num_count = num_count+1;
end
end
genotypeFactor.val(i)= num_count/length_tmp_combination;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

1.5 表现型与碱基对之间的关系

  从生物上来讲,碱基对和基因说的是同一个东西。但是在概率图中,有两种建模方式:1、基因是var ;2、碱基是var。 如果基因是var,那么表现型(var)则只由一个var决定,如果碱基是var,则表现型由两个var决定。如果使用碱基作为var来对图进行建模,可以大幅降低CPD的复杂程度。如果每个碱基有n种值,那么基因的CPD会达到恐怖的n!种取值。但是如果使用两个碱基进行描述,则CPD的取值是n+n。从而达到解耦的目的。

  

  function phenotypeFactor = phenotypeGivenCopiesFactor(alphaList, numAlleles, geneCopyVarOne, geneCopyVarTwo, phenotypeVar)

 % alphaList = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01];
% numAlleles = 3;
% geneCopyVarOne= 1;
% geneCopyVarTwo = 2;
% phenotypeVar = 3; % This function makes a factor whose values are the probabilities of
% a phenotype given an allele combination. Note that a person has one
% copy of the gene from each parent. % In the factor, each assignment maps to the allele at the corresponding
% location on the allele list, so allele assignment 1 maps to
% alleleList{1}, allele assignment 2 maps to alleleList{2}, .... For the
% phenotypes, assignment 1 maps to having the physical trait, and
% assignment 2 maps to not having the physical trait. % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
% VARIABLE IN THE .var FIELD FOR GRADING PURPOSES % Input:
% alphaList: m x 1 vector of alpha values for the different genotypes,
% where m is the number of genotypes -- the alpha value for a genotype
% is the probability that a person with that genotype will have the
% physical trait
% numAlleles: int that is the number of alleles
% geneCopyVarOne: Variable number corresponding to the variable for
% the first copy of the gene (goes in the .var part of the factor)
% geneCopyVarTwo: Variable number corresponding to the variable for
% the second copy of the gene (goes in the .var part of the factor)
% phenotypeVar: Variable number corresponding to the variable for the
% phenotype (goes in the .var part of the factor)
%
% Output:
% phenotypeFactor: Factor in which the values are the probabilities of
% having each phenotype for each allele combination (note that this is
% the FULL CPD with no evidence observed) phenotypeFactor = struct('var', [], 'card', [], 'val', []); % Each allele has an ID. Each genotype also has an ID, which is the index
% of its alpha in the list of alphas. There is a mapping from a pair of
% allele IDs to genotype IDs and from genotype IDs to a pair of allele IDs
% below; we compute this mapping using
% generateAlleleGenotypeMappers(numAlleles). (A genotype consists of 2
% alleles.) [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles); % One or both of these matrices might be useful.
%
% 1. allelesToGenotypes: n x n matrix that maps pairs of allele IDs to
% genotype IDs, where n is the number of alleles -- if
% allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of
% the alleles with IDs i and j
%
% 2. genotypesToAlleles: m x 2 matrix of allele IDs, where m is the
% number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the
% genotype with ID k is comprised of the allele with ID i and the allele
% with ID j %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
% The number of phenotypes is 2
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in phenotypeFactor.var. This should be a 1-D row vector.
% Fill in phenotypeFactor.card. This should be a 1-D row vector.
phenotypeFactor.var = [phenotypeVar geneCopyVarOne geneCopyVarTwo];
phenotypeFactor.card= [2 numAlleles numAlleles]; phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
% Replace the zeros in phentoypeFactor.val with the correct values.
assign_alle = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card); for i = 1:prod(phenotypeFactor.card)
alle_type = assign_alle(i,2:3);
if assign_alle(i,1) == 1
phenotypeFactor.val(i) = alphaList(allelesToGenotypes(alle_type(1),alle_type(2)));
else
phenotypeFactor.val(i) = 1 - alphaList(allelesToGenotypes(alle_type(1),alle_type(2)));
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2. 构造Graph

2.1 通常的Graph

  在有了CPD之后,只需要把CPD串起来,就可以得到图模型了。CPD描述的是变量与变量之间的关系,对于基因---表现型而言,这种关系是“一般的”。也就是这种关系放在任意一个普系中都适用。构造Graph的函数应该返回很多的局部CPD。也就是变量与其父节点构成的CPD。在得到这些CPD之后,只需要按照贝耶斯网络规律连乘,就可以得到Combine CPD。使用evidence Observe,就可以观察到变量的分布情况。

  

  function factorList = constructGeneticNetwork(pedigree, alleleFreqs, alphaList)
% pedigree = struct('parents', [0,0;1,3;0,0]);
% pedigree.names = {'Ira','James','Robin'};
% alleleFreqs = [0.1; 0.9];
% alphaList = [0.8; 0.6; 0.1];
% This function constructs a Bayesian network for genetic inheritance. It
% assumes that there are only 2 phenotypes. It also assumes that either
% both parents are specified or neither parent is specified.
%
% In Matlab, each variable will have a number. We need a consistent way of
% numbering variables when instantiating CPDs so that we know what
% variables are involved in each CPD. For example, if IraGenotype is in
% multiple CPDs and IraGenotype is number variable 1 in a CPD, then
% IraGenotype should be numbered as variable 1 in all CPDs and no other
% variables should be numbered 1; thus, every time variable 1 appears in
% our network, we will know that it refers to Ira's genotype.
%
% Here is how the variables should be numbered, for a pedigree with n
% people:
%
% 1. The first n variables should be the genotype variables and should
% be numbered according to the index of the corresponding person in
% pedigree.names; the ith person with name pedigree.names{i} has genotype
% variable number i.
%
% 2. The next n variables should be the phenotype variables and should be
% numbered according to the index of the corresponding person in
% pedigree.names; the ith person with name pedigree.names{i} has phenotype
% variable number n+i.
%
% Here is an example of how the variable numbering should work: if
% pedigree.names = {'Ira', 'James', 'Robin'} and pedigree.parents = [0, 0;
% 1, 3; 0, 0], then the variable numbering is as follows:
%
% Variable 1: IraGenotype
% Variable 2: JamesGenotype
% Variable 3: RobinGenotype
% Variable 4: IraPhenotype
% Variable 5: JamesPhenotype
% Variable 6: RobinPhenotype
%
% Input:
% pedigree: Data structure that includes names, genders, and parent-child
% relationships
% alleleFreqs: Frequencies of alleles in the population
% alphaList: m x 1 vector of alpha values for genotypes, where m is the
% number of genotypes -- the alpha value for a genotype is the
% probability that a person with that genotype will have the
% physical trait
%
% Output:
% factorList: Struct array of factors for the genetic network (In each
% factor, .var, .card, and .val are all row 1-D vectors.) numPeople = length(pedigree.names); % Initialize factors
% The number of factors is twice the number of people because there is a
% factor for each person's genotype and a separate factor for each person's
% phenotype. Note that the order of the factors in the list does not
% matter.
factorList(2*numPeople) = struct('var', [], 'card', [], 'val', []); numAlleles = length(alleleFreqs); % Number of alleles %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
% Variable numbers:
% 1 - numPeople: genotype variables
% numPeople+1 - 2*numPeople: phenotype variables
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for i = 1 : 2*numPeople
if i <= numPeople
pedigree_parents = pedigree.parents(i,:);
if max(pedigree_parents) == 0
tmp_ = genotypeGivenAlleleFreqsFactor(alleleFreqs,i);
else
tmp_ = genotypeGivenParentsGenotypesFactor(length(alleleFreqs),i,pedigree_parents(1),pedigree_parents(2));
end
else
tmp_ = phenotypeGivenGenotypeFactor(alphaList,i-numPeople,i);
end
factorList(i) = tmp_;
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

  在完成这个构造规律函数之后,Koller 教授提供了一种更优雅可可视化方法,可以将这些CPD发送到SamIam里。

机器学习 —— 概率图模型(Homework: StructuredCPD)

2.2 解耦网络的Graph

  解耦网络的好处是可以应对某个碱基有多种取值的情况。但是代价是解耦网络的结构和普通网络完全不同。

 function factorList = constructDecoupledGeneticNetwork(pedigree, alleleFreqs, alphaList)

 % pedigree = struct('parents', [0,0;1,3;0,0]);
% pedigree.names = {'Ira','James','Robin'};
% alleleFreqs = [0.1; 0.7; 0.2];
% alleleListThree = {'F', 'f', 'n'};
% alphaList = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01]; % This function constructs a Bayesian network for genetic inheritance. It
% assumes that there are only 2 phenotypes. It also assumes that, in the
% pedigree, either both parents are specified or neither parent is
% specified.
%
% In Matlab, each variable will have a number. We need a consistent way of
% numbering variables when instantiating CPDs so that we know what
% variables are involved in each CPD. For example, if IraGenotype is in
% multiple CPDs and IraGenotype is number variable 1 in a CPD, then
% IraGenotype should be numbered as variable 1 in all CPDs and no other
% variables should be numbered 1; thus, every time variable 1 appears in
% our network, we will know that it refers to Ira's genotype.
%
% Here is how the variables should be numbered, for a pedigree with n
% people:
%
% 1. The first n variables should be the gene copy 1 variables and should
% be numbered according to the index of the corresponding person in
% pedigree.names; the ith person with name pedigree.names{i} has gene copy
% 1 variable number i. If the parents are specified, then gene copy 1 is the
% copy inherited from pedigree.names{pedigree.parents(i, 1)}.
%
% 2. The next n variables should be the gene copy 2 variables and should
% be numbered according to the index of the corresponding person in
% pedigree.names; the ith person with name pedigree.names{i} has gene copy
% 2 variable number n+i. If the parents are specified, then gene copy 2 is the
% copy inherited from pedigree.parents(i, 2).
%
% 3. The final n variables should be the phenotype variables and should be
% numbered according to the index of the corresponding person in
% pedigree.names; the ith person with name pedigree.names{i} has phenotype
% variable number 2n+i.
%
% Here is an example of how the variable numbering should work: if
% pedigree.names = {'Ira', 'James', 'Robin'} and pedigree.parents = [0, 0;
% 1, 3; 0, 0], then the variable numbering is as follows:
%
% Variable 1: IraGeneCopy1
% Variable 2: JamesGeneCopy1
% Variable 3: RobinGeneCopy1
% Variable 4: IraGeneCopy2
% Variable 5: JamesGeneCopy2
% Variable 6: RobinGeneCopy2
% Variable 7: IraPhenotype
% Variable 8: JamesPhenotype
% Variable 9: RobinPhenotype
%
% Input:
% pedigree: Data structure that includes the names and parents of each
% person
% alleleFreqs: n x 1 vector of allele frequencies in the population,
% where n is the number of alleles
% alphaList: m x 1 vector of alphas for different genotypes, where m is
% the number of genotypes -- the alpha value for a genotype is the
% probability that a person with that genotype will have the physical
% trait
%
% Output:
% factorList: struct array of factors for the genetic network (In each
% factor, .var, .card, and .val are row 1-D vectors.) numPeople = length(pedigree.names); % Initialize factors
% We Need 3*numPeople factors because, for each person, there is a factor
% for the CPD at each of 2 parental copies of the gene and a factor for the
% CPD at the phenotype Note that the order of the factors in the list does
% not matter.
factorList(3*numPeople) = struct('var', [], 'card', [], 'val', []); % Initialize factors
factorList(3*numPeople) = struct('var', [], 'card', [], 'val', []); numAlleles = length(alleleFreqs); % Number of alleles %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INSERT YOUR CODE HERE
% Variable numbers:
% 1 - numPeople: first parent copy of gene variables
% numPeople+1 - 2*numPeople: second parent copy of gene variables
% 2*numPeople+1 - 3*numPeople: phenotype variables
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% for k = 1:3
switch k
case 1
for i = 1:numPeople
pedigree_parents = pedigree.parents(i,:);
if max(pedigree_parents)==0
factorList(i) = childCopyGivenFreqsFactor(alleleFreqs,i);
else
factorList(i) = childCopyGivenParentalsFactor(numAlleles,i,pedigree_parents(1),pedigree_parents(2));
end
end
case 2
for i = (numPeople+1):2*numPeople
pedigree_parents = pedigree.parents(i-numPeople,:);
if max(pedigree_parents)==0
factorList(i) = childCopyGivenFreqsFactor(alleleFreqs,i);
else
factorList(i) = childCopyGivenParentalsFactor(numAlleles,i,pedigree_parents(1),pedigree_parents(2));
end
end
case 3
for i = 2*numPeople+1:3*numPeople
factorList(i) = phenotypeGivenCopiesFactor(alphaList,numAlleles,i,i-numPeople,i-2*numPeople);
end
end
end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

  同样,此网络也可以可视化。不赘述。

3、多父节点CPD构造

  对于父节点较少的情况,我们可以不同的程序生成CPD。但是现实生活中,某个变量往往由多个变量决定。这是如果要一项一项指定CPD表格会是一件非常麻烦的事情。不如直接指定父节点与子节点连接权重,然后后由indicate函数点亮连接。最后再由sigmoid函数生成变量之间的关系。这是一种常见的方法。

  此实例是某种表现型由多个碱基(>4)的取值决定。每种碱基影响不同。故此CPD至少有5个entry(表现型+4个碱基对),假设每个entry两种取值,则有32种组合。

  function phenotypeFactor = constructSigmoidPhenotypeFactor(alleleWeights, geneCopyVarOneList, geneCopyVarTwoList, phenotypeVar)

 % alleleWeights = {[3, -3], [0.9, -0.8]};
% geneCopyVarOneList = [1; 2];
% geneCopyVarTwoList = [4; 5];
% phenotypeVar = 3; % This function takes a cell array of alleles' weights and constructs a
% factor expressing a sigmoid CPD.
%
% You can assume that there are only 2 genes involved in the CPD.
%
% In the factor, for each gene, each allele assignment maps to the allele
% whose weight is at the corresponding location. For example, for gene 1,
% allele assignment 1 maps to the allele whose weight is at
% alleleWeights{1}(1) (same as w_1^1), allele assignment 2 maps to the
% allele whose weight is at alleleWeights{1}(2) (same as w_2^1),....
%
% You may assume that there are 2 possible phenotypes.
% For the phenotypes, assignment 1 maps to having the physical trait, and
% assignment 2 maps to not having the physical trait.
%
% THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
% VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
%
% Input:
% alleleWeights: Cell array of weights, where each entry is an 1 x n
% of weights for the alleles for a gene (n is the number of alleles for
% the gene)
% geneCopyVarOneList: m x 1 vector (m is the number of genes) of variable
% numbers that are the variable numbers for each of the first parent's
% copy of each gene (numbers in this list go in the .var part of the
% factor)
% geneCopyVarTwoList: m x 1 vector (m is the number of genes) of variable
% numbers that are the variable numbers for each of the second parent's
% copy of each gene (numbers in this list go in the .var part of the
% factor) -- Note that both copies of each gene are from the same person,
% but each copy originally came from a different parent
% phenotypeVar: Variable number corresponding to the variable for the
% phenotype (goes in the .var part of the factor)
%
% Output:
% phenotypeFactor: Factor in which the values are the probabilities of
% having each phenotype for each allele combination (note that this is
% the FULL CPD with no evidence observed) phenotypeFactor = struct('var', [], 'card', [], 'val', []); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% INSERT YOUR CODE HERE
% Note that computeSigmoid.m will be useful for this function.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in phenotypeFactor.var. This should be a 1-D row vector.
% Fill in phenotypeFactor.card. This should be a 1-D row vector.
phenotypeFactor.var = [phenotypeVar geneCopyVarOneList' geneCopyVarTwoList'];
%num_of_alle_in_each_geno
geno_var_card = []; for i=1:length(geneCopyVarOneList)
geno_var_card = [geno_var_card max(size(alleleWeights{i}))];
end
phenotypeFactor.card =[2 geno_var_card geno_var_card];
phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
% Replace the zeros in phentoypeFactor.val with the correct values.
assign_allele = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card); for i = 1:prod(phenotypeFactor.card)
pheno_allele_ = assign_allele(i,:);
% start from the second var
k = 2;
weight = 0;
% a pair of gene come from *two* people
for j = 1:2
% get num of allele from one people
for l = 1:length(geneCopyVarOneList)
weight = weight+alleleWeights{l}(pheno_allele_(k));
k = k+1;
end
end
switch pheno_allele_(1)
case 1
phenotypeFactor.val(i) = computeSigmoid(weight);
case 2
phenotypeFactor.val(i) = 1-computeSigmoid(weight);
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

  所有代码点击这里 

上一篇:机器学习 —— 概率图模型(Homework: Factors)


下一篇:exgcd