Hi Vikas -- the optimum number of topics (K in LDA) is dependent on a at least two factors:
Firstly, your data set may have an intrinsic number of topics, i.e., may derive
from some natural clusters that your data have. This number will in the best
case make your ppx minimal. A non-parametric approach like HDP would ideally
result in the same K as the one that minimises ppx for LDA. The second type of
influence is that of the hyperparameters. If you fix the Dirichlet parameters
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and
(phi | beta)), you bias the optimum K. For instance, larger alpha will force
more " "decisive" choices of z for each token, leading to a concentration of
theta to fewer weights, which influences K.
Trouble minimizing perplexity in LDA
I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong. Definition and details regarding calculation of perplexity of a topic model is explained in this post. Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters. |
|||||||||||||||||
|
You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results. In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit. |