Description
Problem description
I'm using LDA Multicore from gensim 3.8.3. I'm training on my train corpus and I'm able to evaluate the train corpus using the CoherenceModel within Gensim, to calculate the 'c_v' value. However, when I'm trying to calculate the 'c_v' over my test set, it throws the following warning:
/Users/xxx/env/lib/python3.7/site-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
/Users/xxx/lib/python3.7/site-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))
Furthermore, the output value of the CoherenceModel is 'nan' for some of the topics and therefore I'm not able to evaluate my model on a heldout test set.
Steps/code/corpus to reproduce
I run the following code:
coherence_model_lda = models.CoherenceModel(model=lda_model,
topics=topic_list,
corpus=corpus,
texts=texts,
dictionary=train_dictionary,
coherence=c_v,
topn=20
)
coherence_model_lda.get_coherence() = nan # output of aggregated cv value
coherence_model_lda.get_coherence_per_topic() = [0.4855137269180713, 0.3718866594914528, nan, nan, nan, 0.6782845928414825, 0.21638660621444444, 0.22337594485796397, 0.5975773184175942, 0.721341268732559, 0.5299883104816663, 0.5057903454344682, 0.5818051100304473, nan, nan, 0.30613393712342557, nan, 0.4104488627000527, nan, nan, 0.46028708148750963, nan, 0.394606654755219, 0.520685457293826, 0.5918440959767729, nan, nan, 0.4842068862650447, 0.9350644411891258, nan, nan, 0.7471151926054456, nan, nan, 0.5084926961568169, nan, nan, 0.4322957454944861, nan, nan, nan, 0.6460815758337844, 0.5810936860540964, 0.6636319471764807, nan, 0.6129884526648472, 0.48915614063099017, 0.4746167359622748, nan, 0.6826979166639224] # output of coherence value per topic
I've tried to increase the EPSILON value within:
gensim.topic_coherence.direct_confirmation_measure, however, this doesn't have any effect.
Furthermore, I've tried to change the input arguments (e.g. exclude the dictionary argument) but this also doesn't have any effect. I think the error has to do something with the fact that quite a large portion of the words within the test set is not available in the train set, however, the EPSILON value should be able to handle this.
Versions
python
Python 3.7.2 (default, Dec 2 2020, 09:47:26)
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.7.2 (default, Dec 2 2020, 09:47:26)
[Clang 9.0.0 (clang-900.0.39.2)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.18.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.2
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)