Skip to content

'nan' output for CoherenceModel when calculating 'c_v' #3040

Open
@job-almekinders

Description

@job-almekinders

Problem description

I'm using LDA Multicore from gensim 3.8.3. I'm training on my train corpus and I'm able to evaluate the train corpus using the CoherenceModel within Gensim, to calculate the 'c_v' value. However, when I'm trying to calculate the 'c_v' over my test set, it throws the following warning:

/Users/xxx/env/lib/python3.7/site-packages/gensim/topic_coherence/direct_confirmation_measure.py:204: RuntimeWarning: divide by zero encountered in double_scalars
m_lr_i = np.log(numerator / denominator)
/Users/xxx/lib/python3.7/site-packages/gensim/topic_coherence/indirect_confirmation_measure.py:323: RuntimeWarning: invalid value encountered in double_scalars
return cv1.T.dot(cv2)[0, 0] / (_magnitude(cv1) * _magnitude(cv2))

Furthermore, the output value of the CoherenceModel is 'nan' for some of the topics and therefore I'm not able to evaluate my model on a heldout test set.

Steps/code/corpus to reproduce

I run the following code:

coherence_model_lda = models.CoherenceModel(model=lda_model,
                                            topics=topic_list,
                                            corpus=corpus,
                                            texts=texts,
                                            dictionary=train_dictionary,
                                            coherence=c_v,
                                            topn=20
                                            )

coherence_model_lda.get_coherence() = nan  # output of aggregated cv value

coherence_model_lda.get_coherence_per_topic() = [0.4855137269180713, 0.3718866594914528, nan, nan, nan, 0.6782845928414825, 0.21638660621444444, 0.22337594485796397, 0.5975773184175942, 0.721341268732559, 0.5299883104816663, 0.5057903454344682, 0.5818051100304473, nan, nan, 0.30613393712342557, nan, 0.4104488627000527, nan, nan, 0.46028708148750963, nan, 0.394606654755219, 0.520685457293826, 0.5918440959767729, nan, nan, 0.4842068862650447, 0.9350644411891258, nan, nan, 0.7471151926054456, nan, nan, 0.5084926961568169, nan, nan, 0.4322957454944861, nan, nan, nan, 0.6460815758337844, 0.5810936860540964, 0.6636319471764807, nan, 0.6129884526648472, 0.48915614063099017, 0.4746167359622748, nan, 0.6826979166639224] # output of coherence value per topic 

I've tried to increase the EPSILON value within:
gensim.topic_coherence.direct_confirmation_measure, however, this doesn't have any effect.

Furthermore, I've tried to change the input arguments (e.g. exclude the dictionary argument) but this also doesn't have any effect. I think the error has to do something with the fact that quite a large portion of the words within the test set is not available in the train set, however, the EPSILON value should be able to handle this.

Versions

python
Python 3.7.2 (default, Dec  2 2020, 09:47:26) 
[Clang 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import platform; print(platform.platform())
Darwin-18.7.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
Python 3.7.2 (default, Dec  2 2020, 09:47:26) 
[Clang 9.0.0 (clang-900.0.39.2)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.18.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.2
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.3
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIssue described a bugdifficulty mediumMedium issue: required good gensim understanding & python skillsimpact HIGHShow-stopper for affected usersreach LOWAffects only niche use-case users

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions