API Documentation¶

class topsbm.TopSBM(n_init=1, min_groups=None, max_groups=None, weighted_edges=True, random_state=None)[source]¶

A Scikit-learn compatible transformer for hSBM topic models

Parameters:

Parameters:	n_init : int, default=1 Number of random initialisations to perform in order to avoid a local minimum of MDL. The minimum MDL solution is chosen. min_groups : int, default=None The minimum number of word and docuent groups to infer. This is also a lower bound on the number of topics. max_groups : int, default=None The maximum number of word and docuent groups to infer. This also an upper bound on the number of topics. weighted_edges : bool, default=True When True, edges are weighted instead of adding duplicate edges. random_state : None, int or np.random.RandomState Controls randomization. See Scikit-learn’s glossary. Note that if this is set, the global random state of libcore will be affected, and the global random state of numpy will be temporarily affected.

n_init : int, default=1

Number of random initialisations to perform in order to avoid a local minimum of MDL. The minimum MDL solution is chosen.

min_groups : int, default=None

The minimum number of word and docuent groups to infer. This is also a lower bound on the number of topics.

max_groups : int, default=None

The maximum number of word and docuent groups to infer. This also an upper bound on the number of topics.

weighted_edges : bool, default=True

When True, edges are weighted instead of adding duplicate edges.

random_state : None, int or np.random.RandomState

Controls randomization. See Scikit-learn’s glossary.

Note that if this is set, the global random state of libcore will be affected, and the global random state of numpy will be temporarily affected.

References

Martin Gerlach, Tiago P. Peixoto, and Eduardo G. Altmann, “A network approach to topic models,”. Science Advances (2018)

Attributes:

Attributes:	graph_ : graph_tool.Graph Bipartite graph between samples (the first n_samples_ vertices) and features (the remaining vertices) state_ Inference state from graphtool n_levels_ : int The number of levels in the inferred hierarchy of groups. groups_ : dict Results of group membership from inference. Key is an integer, indicating the level of grouping (starting from 0). Value is a dict of information about the grouping which contains: B_d : int number of doc-groups B_w : int number of word-groups p_tw_d : array of shape (B_w, d) doc-topic mixtures: prob of word-group tw in doc d P(tw \| d) p_td_d : array of shape (B_d, n_samples) doc-group membership: prob that doc-node d belongs to doc-group td: P(td \| d) p_tw_w : array of shape (B_w, n_features) word-group-membership: prob that word-node w belongs to word-group tw: P(tw \| w) p_w_tw : array of shape (n_features, B_w) topic distribution: prob of word w given topic tw P(w \| tw) Here “d”/document refers to samples; “w”/word refers to features. mdl_ minimum description length of inferred state n_features_ : int n_samples_ : int

graph_ : graph_tool.Graph

Bipartite graph between samples (the first n_samples_ vertices) and features (the remaining vertices)

state_

Inference state from graphtool

n_levels_ : int

The number of levels in the inferred hierarchy of groups.

groups_ : dict

Results of group membership from inference. Key is an integer, indicating the level of grouping (starting from 0). Value is a dict of information about the grouping which contains:

B_d : int: number of doc-groups
B_w : int: number of word-groups
p_tw_d : array of shape (B_w, d): doc-topic mixtures: prob of word-group tw in doc d P(tw | d)
p_td_d : array of shape (B_d, n_samples): doc-group membership: prob that doc-node d belongs to doc-group td: P(td | d)
p_tw_w : array of shape (B_w, n_features): word-group-membership: prob that word-node w belongs to word-group tw: P(tw | w)
p_w_tw : array of shape (n_features, B_w): topic distribution: prob of word w given topic tw P(w | tw)

Here “d”/document refers to samples; “w”/word refers to features.

mdl_

minimum description length of inferred state

n_features_ : int

n_samples_ : int

Methods

`fit`(X[, y])	Fit the hSBM topic model
`fit_transform`(X[, y])	Fit the hSBM topic model
`get_params`([deep])	Get parameters for this estimator.
`plot_graph`([filename, n_edges])	Plots arcs from documents to words coloured by inferred group
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y=None)[source]¶

Fit the hSBM topic model

Constructs a graph representation of X and infers clustering.

Parameters:	X : ndarray or sparse matrix of shape (n_samples, n_features) Word frequencies for each document, represented as non-negative integers. y : ignored
Returns:	self

fit_transform(X, y=None)[source]¶

Fit the hSBM topic model

Constructs a graph representation of X, infers clustering, and reports the cluster probability for each sample in X.

Parameters:	X : ndarray or sparse matrix of shape (n_samples, n_features) Word frequencies for each document, represented as non-negative integers. y : ignored
Returns:	Xt : ndarray of shape (n_samples, n_components) The cluster probability for each sample in X

plot_graph(filename=None, n_edges=1000)[source]¶

Plots arcs from documents to words coloured by inferred group

Parameters:	filename : str, optional Path to write to (e.g. ‘something.png’). Otherwise returns a displayable object. n_edges : int Size of subsample to plot (reducing memory requirements)