Example: Introduction to topsbm

Previous topic

Next topic

This Page

Example: Introduction to topsbm¶

Topic modelling with hierarchical stochastic block models

[1]:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from topsbm import TopSBM

Setup: Load a corpus¶

We have a list of documents, each document contains a list of words.
We have a list of document titles (optional)

The example corpus consists of 63 articles from Wikipedia taken from 3 different categories (Experimental Physics, Chemical Physics, and Computational Biology).

We use scikit-learn’s CountVectorizer to turn this text into a feature matrix.

[2]:

# Load texts and vectorize
with open('corpus.txt', 'r') as f:
    docs = f.readlines()

vec = CountVectorizer(token_pattern=r'\S+')
X = vec.fit_transform(docs)

# X is now a sparse matrix of (docs, words)

# titles corresponding to docs
with open('titles.txt', 'r') as f:
    x = f.readlines()
titles = [h.split()[0] for h in x]

[3]:

# view the data for document 0
print(titles[0])
print(docs[0][:100])

Nuclear_Overhauser_effect
 the nuclear overhauser effect noe is the transfer of nuclear spin polarization from one nuclear spi

Fit the model¶

Calling TopSBM.fit_transform will: * construct the bipartite graph between documents and words (samples and features) * perform Hierarchical Stochastic Block Model inference over the graph * return an embedding of the samples in the block level with finest granularity

[18]:

model = TopSBM(random_state=9)
Xt = model.fit_transform(X)

Plotting the graph and block structure¶

The following plot shows the (hierarchical) community structure in the word-document network as inferred by the stochastic block model:

document-nodes are on the left
word-nodes are on the right
different colors correspond to the different groups

The result is a grouping of nodes into groups on multiple levels in the hierarchy:

on the uppermost level, each node belongs to the same group (square in the middle)
on the next-lower level, we split the network into two groups: the word-nodes and the document-nodes (blue sqaures to the left and right, respectively). This is a trivial structure due to the bipartite character of the network.
only next lower levels constitute a non-trivial structure: We now further divide nodes into smaller groups (document-nodes into document-groups on the left and word-nodes into word-groups on the right)

In the code, the lowest level is known as level 0, with coarser levels 1, 2, …

[19]:

model.plot_graph(n_edges=1000)

Topics¶

For each word-group on a given level in the hierarchy, we retrieve the $n$ most common words in each group – these are the topics!

[20]:

topics = pd.DataFrame(model.groups_[1]['p_w_tw'],
                      index=vec.get_feature_names())

[21]:

for topic in topics.columns:
    print(topics[topic].nlargest(10))
    print()

the    0.006768
of     0.006661
a      0.006554
in     0.006446
is     0.006446
to     0.006339
and    0.006124
for    0.005372
an     0.005264
as     0.005264
Name: 0, dtype: float64

when       0.008921
where      0.008058
first      0.006619
given      0.006331
if         0.006331
field      0.006043
applied    0.005755
because    0.005755
e          0.005468
energy     0.005468
Name: 1, dtype: float64

computational    0.060606
development      0.055556
proteins         0.045455
open             0.040404
protein          0.040404
software         0.040404
community        0.035354
researchers      0.035354
core             0.025253
identify         0.025253
Name: 2, dtype: float64

point       0.217391
formula     0.188406
must        0.144928
wave        0.115942
spectrum    0.086957
air         0.072464
plane       0.057971
flow        0.043478
q           0.043478
mode        0.028986
Name: 3, dtype: float64

Topic-distribution in each document¶

Which level-1 topics contribute to each document?

[22]:

pd.DataFrame(model.groups_[1]['p_tw_d'],
             columns=titles)

[22]:

	Nuclear_Overhauser_effect	Quantum_solvent	Rovibrational_coupling	Effective_field_theory	Chemical_physics	Rotational_transition	Dynamic_nuclear_polarisation	Knight_shift	Polarizability	Anisotropic_liquid	...	Louis_and_Beatrice_Laufer_Center_for_Physical_and_Quantitative_Biology	Law_of_Maximum	Enzyme_Function_Initiative	SnoRNA_prediction_software	Sepp_Hochreiter	Aureus_Sciences	IEEE/ACM_Transactions_on_Computational_Biology_and_Bioinformatics	Knotted_protein	BioUML	De_novo_transcriptome_assembly
0	0.608392	0.856	0.523529	0.804651	0.82	0.648649	0.584192	0.582418	0.493274	0.645161	...	0.907692	0.851351	0.857143	0.857143	0.846690	0.822222	0.84375	0.773585	0.870647	0.868932
1	0.391608	0.144	0.458824	0.190698	0.16	0.337838	0.412371	0.406593	0.500000	0.354839	...	0.092308	0.121622	0.095238	0.142857	0.139373	0.133333	0.09375	0.169811	0.084577	0.092233
2	0.000000	0.000	0.000000	0.000000	0.00	0.000000	0.000000	0.010989	0.002242	0.000000	...	0.000000	0.027027	0.044218	0.000000	0.013937	0.044444	0.06250	0.047170	0.044776	0.038835
3	0.000000	0.000	0.017647	0.004651	0.02	0.013514	0.003436	0.000000	0.004484	0.000000	...	0.000000	0.000000	0.003401	0.000000	0.000000	0.000000	0.00000	0.009434	0.000000	0.000000

4 rows × 63 columns

Extra: Clustering of documents - for free.¶

The stochastic block models clusters the documents into groups. We do not need to run an additional clustering to obtain this grouping.

For a query article, we can return all articles from the same group

[23]:

cluster_labels = pd.DataFrame(model.groups_[1]['p_td_d'],
                              columns=titles).idxmax(axis=0)
cluster_idx = cluster_labels['Rovibrational_coupling']
cluster_labels[cluster_labels == cluster_idx]

[23]:

Nuclear_Overhauser_effect                        0
Rovibrational_coupling                           0
Rotational_transition                            0
Dynamic_nuclear_polarisation                     0
Knight_shift                                     0
Polarizability                                   0
Anisotropic_liquid                               0
Rotating_wave_approximation                      0
Molecular_vibration                              0
Fuel_mass_fraction                               0
Electrostatic_deflection_(structural_element)    0
Magic_angle_(EELS)                               0
Reactive_empirical_bond_order                    0
Photofragment-ion_imaging                        0
Molecular_beam                                   0
McConnell_equation                               0
Ziff-Gulari-Barshad_model                        0
Empirical_formula                                0
Newton's_laws_of_motion                          0
Ripple_tank                                      0
Particle-induced_X-ray_emission                  0
Elevator_paradox_(physics)                       0
Wave_tank                                        0
X-ray_crystal_truncation_rod                     0
Faraday_cup_electrometer                         0
Line_source                                      0
X-ray_standing_waves                             0
Point_source                                     0
Fragment_separator                               0
Dynamic_mode_decomposition                       0
Euler's_laws_of_motion                           0
Quantum_oscillations_(experimental_technique)    0
dtype: int64

Table of Contents