Previous topic

Installation

This Page

API Documentation

class topsbm.TopSBM(n_init=1, min_groups=None, max_groups=None, weighted_edges=True, random_state=None)[source]

A Scikit-learn compatible transformer for hSBM topic models

Parameters:
n_init : int, default=1

Number of random initialisations to perform in order to avoid a local minimum of MDL. The minimum MDL solution is chosen.

min_groups : int, default=None

The minimum number of word and docuent groups to infer. This is also a lower bound on the number of topics.

max_groups : int, default=None

The maximum number of word and docuent groups to infer. This also an upper bound on the number of topics.

weighted_edges : bool, default=True

When True, edges are weighted instead of adding duplicate edges.

random_state : None, int or np.random.RandomState

Controls randomization. See Scikit-learn’s glossary.

Note that if this is set, the global random state of libcore will be affected, and the global random state of numpy will be temporarily affected.

References

Martin Gerlach, Tiago P. Peixoto, and Eduardo G. Altmann, “A network approach to topic models,”. Science Advances (2018)

Attributes:
graph_ : graph_tool.Graph

Bipartite graph between samples (the first n_samples_ vertices) and features (the remaining vertices)

state_

Inference state from graphtool

n_levels_ : int

The number of levels in the inferred hierarchy of groups.

groups_ : dict

Results of group membership from inference. Key is an integer, indicating the level of grouping (starting from 0). Value is a dict of information about the grouping which contains:

B_d : int

number of doc-groups

B_w : int

number of word-groups

p_tw_d : array of shape (B_w, d)

doc-topic mixtures: prob of word-group tw in doc d P(tw | d)

p_td_d : array of shape (B_d, n_samples)

doc-group membership: prob that doc-node d belongs to doc-group td: P(td | d)

p_tw_w : array of shape (B_w, n_features)

word-group-membership: prob that word-node w belongs to word-group tw: P(tw | w)

p_w_tw : array of shape (n_features, B_w)

topic distribution: prob of word w given topic tw P(w | tw)

Here “d”/document refers to samples; “w”/word refers to features.

mdl_

minimum description length of inferred state

n_features_ : int
n_samples_ : int

Methods

fit(X[, y]) Fit the hSBM topic model
fit_transform(X[, y]) Fit the hSBM topic model
get_params([deep]) Get parameters for this estimator.
plot_graph([filename, n_edges]) Plots arcs from documents to words coloured by inferred group
set_params(**params) Set the parameters of this estimator.
fit(X, y=None)[source]

Fit the hSBM topic model

Constructs a graph representation of X and infers clustering.

Parameters:
X : ndarray or sparse matrix of shape (n_samples, n_features)

Word frequencies for each document, represented as non-negative integers.

y : ignored
Returns:
self
fit_transform(X, y=None)[source]

Fit the hSBM topic model

Constructs a graph representation of X, infers clustering, and reports the cluster probability for each sample in X.

Parameters:
X : ndarray or sparse matrix of shape (n_samples, n_features)

Word frequencies for each document, represented as non-negative integers.

y : ignored
Returns:
Xt : ndarray of shape (n_samples, n_components)

The cluster probability for each sample in X

plot_graph(filename=None, n_edges=1000)[source]

Plots arcs from documents to words coloured by inferred group

Parameters:
filename : str, optional

Path to write to (e.g. ‘something.png’). Otherwise returns a displayable object.

n_edges : int

Size of subsample to plot (reducing memory requirements)