ark.spLDA

ark.spLDA.processing

ark.spLDA.processing.compute_topic_eda(features, featurization, topics, silhouette=False, num_boots=None)[source]

Computes five metrics for k-means clustering models to help determine an appropriate number of topics for use in spatial-LDA analysis.

The five metrics are: - Inertia: the total sum of within-cluster variance for all clusters. - Silhouette Score: the silhouette score is a goodness-of-fit measurement for clustering. Values closer to 1 indicate that most observations are well-matched to their cluster, while values closer to -1 indicate poorly matched observations. - Gap Statistic: the gap statistic, \(Gap(k)\), is a re-sampling based measure which computes the difference between the log of the pooled within-cluster sum of squares ( \(logW_k\) ) to its expected value ( \(E(logW_k)\) ) under a null distribution. The optimal number of clusters \(k\) is the smallest \(k\) for which \(Gap( k) > Gap(k+1) - s_{k+1}\) where \(s_{k+1}\) is a scaled estimate of the standard error of \(Gap(k+1)\). - Cell Count: the distribution of cell features within each cluster.

Parameters:
  • features (pandas.DataFrame) – A DataFrame of featurized cellular neighborhoods. Specifically, this is one of the outputs of featurize_cell_table().

  • featurization (str) – The featurization method used to construct cellular neighborhoods.

  • topics (list) – A list of integers corresponding to the different number of possible topics to investigate.

  • silhouette (bool) – Whether or not the silhouette score should be computed. This metric can take some time to compute so it is False by default.

  • num_boots (int | None) – The number of bootstrap samples to use when calculating the Gap-statistic. If None, the gap stat will not be computed.

Returns:

A dictionary of dictionaries containing the corresponding metrics for each topic value provided.

Return type:

dict

ark.spLDA.processing.create_difference_matrices(cell_table, features, training=True, inference=True)[source]

Constructs the difference matrices used for training and inference for each field of view in the formatted cell table.

Parameters:
  • cell_table (dict) – A formatted cell table for use in spatial-LDA analysis. Specifically, this is the output from format_cell_table().

  • features (dict) – A dictionary containing the featurized cell table and the training data. Specifically, this is the output from featurize_cell_table().

  • training (bool) – If True, create the difference matrix for running the training algorithm. One or both of training and inference must be True.

  • inference (bool) – If True, create the difference matrix for running inference algorithm.

Returns:

A dictionary containing the difference matrices used for training and inference.

Return type:

dict

ark.spLDA.processing.featurize_cell_table(cell_table, featurization='cluster', radius=100, cell_index='is_index', n_processes=None, train_frac=0.75)[source]

Calculates statistics for local cellular neighborhoods based on the specified features and radius.

Parameters:
  • cell_table (dict) – A formatted cell table for use in spatial-LDA analysis. Specifically, this is the output from format_cell_table().

  • featurization (str) – One of four choices of featurization method, defaults to “cluster” if not provided:

  • marker (-) – for each marker, count the total number of cells within a radius r from cell i having marker expression greater than 0.5.

  • avg_marker (-) – for each marker, compute the average marker expression of all cells within a radius r from cell i.

  • cluster (-) – for each cell cluster, count the total number of cells within a radius r from cell i belonging to that cell cluster.

  • count (-) – counts the total number of cells within a radius r from cell i.

  • radius (int) – Size of the radius, in pixels, used to featurize cellular neighborhoods.

  • cell_index (str) – Name of the column containing the indexes of reference cells to be used in constructing local cellular neighborhoods. If not specified, all cells are used.

  • n_processes (int) – Number of parallel processes to use.

  • train_frac (float) – The fraction of cells from each field of view to be extracted as training data.

Returns:

A dictionary containing a DataFrame of featurized cellular neighborhoods and a separate DataFrame for designated training data. Also returns the featurization method to be used in later functions.

Return type:

dict

ark.spLDA.processing.format_cell_table(cell_table, markers=None, clusters=None)[source]

Formats a cell table containing one for more fields of view to be compatible with the spatial_lda library.

Parameters:
  • cell_table (pandas.DataFrame) – A DataFrame containing the columns of cell marker frequencies and/or cluster ids.

  • markers (list) – A list of strings corresponding to the markers in cell_table which will be used to train the spatial LDA model. Either markers or clusters must be provided.

  • clusters (list) – A list of cell cluster names in cell_table which will be used to train the spatial LDA model.

Returns:

A dictionary of formatted cell tables for use in spatial-LDA analysis. Each element in the dictionary is a Dataframe corresponding to a single field of view.

Return type:

dict

ark.spLDA.processing.fov_density(cell_table, total_pix=1048576)[source]

Computes cellular density metrics for each field of view to determine an appropriate radius for the featurization step.

Parameters:
  • cell_table (dict) – A formatted cell table for use in spatial-LDA analysis. Specifically, this is the output from format_cell_table().

  • total_pix (int) – The total number of pixels in each field of view.

Returns:

A dictionary containing the average cell size, cellular density, and total cell count for each field of view. Cellular density is calculated by summing the total number of pixels occupied by cells divided by the total number of pixels in each field of view.

Return type:

dict

ark.spLDA.processing.gap_stat(features, k, clust_inertia, num_boots=25)[source]

Computes the Gap-statistic for a given k-means clustering model as introduced by Tibshirani, Walther and Hastie (2001).

Parameters:
  • features (pandas.DataFrame) – A DataFrame of featurized cellular neighborhoods. Specifically, this is one of the outputs of featurize_cell_table().

  • k (int) – The number of clusters in the k-means model.

  • clust_inertia (float) – The calculated inertia from the k-means fit using the featurized data.

  • num_boots (int) – The number of bootstrap reference samples to generate.

Returns:

  • Estimated difference between the the expected log within-cluster sum of squares and the observed log within-cluster sum of squares (a.k.a. the Gap-statistic).

  • A scaled estimate of the standard error of the expected log within-cluster sum of squares.

Return type:

tuple (float, float)