ark.analysis

ark.analysis.cell_neighborhood_stats

ark.analysis.cell_neighborhood_stats.calculate_mean_distance_to_all_cell_types(cell_table, dist_xr, k, cell_type_col='cell_meta_cluster', cell_label_col='label')[source]

Wrapper function to calculate mean distance of all cells against all cell types :param cell_table: Dataframe containing all cells and their cell type :type cell_table: pd.DataFrame :param dist_xr: Cell by cell distances for all cells :type dist_xr: xr.array :param k: Number of nearest neighbours :type k: int :param cell_type_col: column with the cell phenotype :type cell_type_col: str :param cell_label_col: column with the cell labels :type cell_label_col: str

Returns:

average distances

Return type:

pd.DataFrame

ark.analysis.cell_neighborhood_stats.calculate_mean_distance_to_cell_type(cell_table, dist_xr, cell_cluster, k, cell_type_col='cell_meta_cluster', cell_label_col='label')[source]

Function to calculate mean distance of all cells to a specified cell type :param cell_table: Dataframe containing all cells and their cell type :type cell_table: pd.DataFrame :param dist_xr: Cell by cell distances for all cells :type dist_xr: xr.array :param cell_cluster: Cell cluster to calculate distance to :type cell_cluster: str :param k: Number of nearest neighbours :type k: int :param cell_type_col: column with the cell phenotype :type cell_type_col: str :param cell_label_col: column with the cell labels :type cell_label_col: str

Returns:

mean distances for each cell to the cluster cells

Return type:

np.array

ark.analysis.cell_neighborhood_stats.compute_neighborhood_diversity(neighborhood_mat, cell_type_col)[source]

Generates a diversity score for each cell using the neighborhood matrix :param neighborhood_mat: the frequency neighbors matrix :type neighborhood_mat: pd.DataFrame :param cell_type_col: the specific name of the cell type column the matrix represents :type cell_type_col: string

Returns:

contains the fov, label, cell_type, and diversity_cell_type values for each cell

Return type:

pd.DataFrame

ark.analysis.cell_neighborhood_stats.generate_cell_distance_analysis(cell_table, dist_mat_dir, save_path, k, cell_type_col='cell_meta_cluster', fov_col='fov', cell_label_col='label')[source]

Creates a dataframe containing the average distance between a cell and other cells of each phenotype, based on the specified cell_type_col. :param cell_table: dataframe containing all cells and their cell type :type cell_table: pd.DataFrame :param dist_mat_dir: path to directory containing the distance matrix files :type dist_mat_dir: str :param save_path: path where to save the results to :type save_path: str :param k: Number of nearest neighbours :type k: int :param fov_col: column containing the image name :type fov_col: str :param cell_type_col: column with the cell phenotype :type cell_type_col: str :param cell_label_col: column with the cell labels :type cell_label_col: str

ark.analysis.cell_neighborhood_stats.generate_neighborhood_diversity_analysis(neighbors_mat_dir, pixel_radius, cell_type_columns)[source]

Generates a diversity score for each cell using the neighborhood matrix :param neighbors_mat_dir: directory containing the neighbors matrices :type neighbors_mat_dir: str :param pixel_radius: radius used to define the neighbors of each cell :type pixel_radius: int :param cell_type_columns: list of cell cluster columns to read in neighbors matrices for :type cell_type_columns: list

Returns:

contains diversity data calculated at each specified cell cluster level

Return type:

pd.DataFrame

ark.analysis.cell_neighborhood_stats.shannon_diversity(proportions)[source]

Calculates the shannon diversity index for the provided proportions of a community :param proportions: the proportions of each individual group :type proportions: np.array

Returns:

the diversity of neighborhood

Return type:

float

ark.analysis.dimensionality_reduction

ark.analysis.dimensionality_reduction.plot_dim_reduced_data(component_one, component_two, fig_id, hue, cell_data, title, title_fontsize=24, palette='Spectral', alpha=0.3, legend_type='full', bbox_to_anchor=(1.05, 1), legend_loc=2, legend_borderaxespad=0.0, dpi=None, save_dir=None, save_file=None)[source]

Helper function to visualize_dimensionality_reduction

Parameters:
  • component_one (pandas.Series) – the data corresponding to the first component

  • component_two (pandas.Series) – the data corresponding to the second component

  • fig_id (int) – the figure identifier for the visualization

  • hue (pandas.Series) – define the hue for each data point

  • cell_data (pandas.DataFrame) – Dataframe containing columns for dimensionality reduction and category

  • title (str) – the title we wish to set for the graph

  • title_fontsize (int) – the fontsize of the title we want

  • palette (str) – the color palette we wish to visualize with

  • alpha (float) – a value to define the opacity of the points visualized

  • legend_type (str) – what type of legend we wish to specify

  • bbox_to_anchor (tuple) – the bounding box of the legend

  • legend_loc (str) – an string describing where we want the legend located

  • legend_borderaxespad (float) – the pad between the axes and legend border

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None

  • save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.dimensionality_reduction.visualize_dimensionality_reduction(cell_data, columns, category, color_map='Spectral', algorithm='UMAP', dpi=None, save_dir=None)[source]

Plots the dimensionality reduction of specified population columns

Parameters:
  • cell_data (pandas.DataFrame) – Dataframe containing columns for dimensionality reduction and category

  • columns (list) – List of column names that are included for dimensionality reduction

  • category (str) – Name of column in dataframe containing population or patient data

  • color_map (str) – Name of MatPlotLib ColorMap used

  • algorithm (str) – Name of dimensionality reduction algorithm, must be UMAP, PCA, or tSNE

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None

ark.analysis.neighborhood_analysis

ark.analysis.neighborhood_analysis.compute_cell_ratios(neighbors_mat, target_cells, reference_cells, fov_list, bin_number=10, cell_col='cell_meta_cluster', fov_col='fov', label_col='label')[source]

Computes the target/reference and reference/target ratios for each FOV

Parameters:
  • neighbors_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix

  • target_cells (list) – invading cell phenotypes

  • reference_cells (list) – expected cell phenotypes

  • fov_list (list) – names of the fovs to compare

  • bin_number (int) – number of bins to use in histogram

  • cell_col (str) – column with the cell phenotype

  • fov_col (str) – column with the fovs

  • label_col (str) – column with the cell labels

Returns:

  • the target/reference ratios of each FOV

  • the reference/target ratios of each FOV

Return type:

tuple(list, list)

ark.analysis.neighborhood_analysis.compute_cluster_metrics_inertia(neighbor_mat, min_k=2, max_k=10, seed=42, included_fovs=None, fov_col='fov', label_col='label', cell_col='cell_meta_cluster')[source]
Produce k-means clustering metrics to help identify optimal number of clusters using

inertia

Parameters:
  • neighbor_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix

  • min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2

  • max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2

  • seed (int) – the random seed to set for k-means clustering

  • included_fovs (list) – fovs to include in analysis. If argument is none, default is all fovs used.

  • fov_col (str) – the name of the column in neighbor_mat indicating the fov

  • label_col (str) – the name of the column in neighbor_mat indicating the label

  • cell_col (str) – column with the cell phenotpype

Returns:

an xarray with dimensions (num_k_values) where num_k_values is the range of integers from 2 to max_k included, contains the metric scores for each value in num_k_values

Return type:

xarray.DataArray

ark.analysis.neighborhood_analysis.compute_cluster_metrics_silhouette(neighbor_mat, min_k=2, max_k=10, seed=42, included_fovs=None, fov_col='fov', label_col='label', cell_col='cell_meta_cluster', subsample=None)[source]
Produce k-means clustering metrics to help identify optimal number of clusters using

Silhouette score

Parameters:
  • neighbor_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix

  • min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2

  • max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2

  • seed (int) – the random seed to set for k-means clustering

  • included_fovs (list) – fovs to include in analysis. If argument is none, default is all fovs used.

  • fov_col (str) – the name of the column in neighbor_mat indicating the fov

  • label_col (str) – the name of the column in neighbor_mat indicating the label

  • cell_col (str) – column with the cell phenotype

  • subsample (int) – the number of cells that will be sampled from each neighborhood cluster for calculating Silhouette score If None, all cells will be used

Returns:

an xarray with dimensions (num_k_values) where num_k_values is the range of integers from 2 to max_k included, contains the metric scores for each value in num_k_values

Return type:

xarray.DataArray

ark.analysis.neighborhood_analysis.compute_mixing_score(fov_neighbors_mat, target_cells, reference_cells, mixing_type, ratio_threshold=5, cell_count_thresh=200, cell_col='cell_meta_cluster', fov_col='fov', label_col='label')[source]

Compute and return the mixing score for the specified target/reference cell types

Parameters:
  • fov_neighbors_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix and subsetted for 1 fov

  • target_cells (list) – invading cell phenotypes

  • reference_cells (list) – expected cell phenotypes

  • mixing_type (str) – “homogeneous” or “percent”, homogeneous is a symmetrical calculation

  • ratio_threshold (int) – maximum ratio of cell_types required to calculate a mixing score, under this labeled “cold”

  • cell_count_thresh (int) – minimum number of total cells from both populations to calculate a mixing score, under this labeled “cold”

  • cell_col (str) – column with the cell phenotype

  • fov_col (str) – column with the fovs

  • label_col (str) – column with the cell labels

Returns:

the mixing score for the FOV

Return type:

float

ark.analysis.neighborhood_analysis.create_neighborhood_matrix(all_data, dist_mat_dir, included_fovs=None, distlim=50, self_neighbor=False, fov_col='fov', cell_label_col='label', cell_type_col='cell_meta_cluster')[source]

Calculates the number of neighbor phenotypes for each cell.

Parameters:
  • all_data (pandas.DataFrame) – data for all fovs. Includes the columns for fov, label, and cell phenotype.

  • dist_mat_dir (str) – directory containing the distance matrices

  • included_fovs (list) – fovs to include in analysis. If argument is none, default is all fovs used.

  • distlim (int) – cell proximity threshold. Default is 50.

  • self_neighbor (bool) – If true, cell counts itself as a neighbor in the analysis. Default is False.

  • fov_col (str) – column with the cell fovs.

  • cell_label_col (str) – column with the cell labels.

  • cell_type_col (str) – column with the cell types.

Returns:

DataFrame containing phenotype counts per cell tupled with DataFrame containing phenotype frequencies of counts per phenotype/total phenotypes for each cell

Return type:

pandas.DataFrame

ark.analysis.neighborhood_analysis.generate_cluster_matrix_results(all_data, neighbor_mat, cluster_num, seed=42, excluded_channels=None, included_fovs=None, cluster_label_col='kmeans_neighborhood', fov_col='fov', cell_type_col='cell_meta_cluster', label_col='label', pre_channel_col='cell_size', post_channel_col='label')[source]

Generate the cluster info on all_data using k-means clustering on neighbor_mat.

cluster_num has to be picked based on visualizations from compute_cluster_metrics.

Parameters:
  • all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers

  • neighbor_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix

  • cluster_num (int) – the optimal k to pass into k-means clustering to generate the final clusters and corresponding results

  • seed (int) – the random seed to set for k-means clustering

  • excluded_channels (list) – all channel names to be excluded from analysis

  • included_fovs (list) – fovs to include in analysis. If argument is None, default is all fovs used

  • cluster_label_col (str) – the name of the cluster label col we will create for neighborhood clusters

  • fov_col (str) – the name of the column in all_data and neighbor_mat indicating the fov

  • cell_type_col (str) – the name of the column in all_data indicating the cell type

  • label_col (str) – the name of the column in all_data indicating cell label

  • pre_channel_col (str) – the name of the column in all_data right before the first channel column

  • post_channel_col (str) – the name of the column in all_data right after the last channel column

Returns:

  • the expression matrix with the corresponding cluster labels attached, will only include fovs included in the analysis

  • an a x b count matrix (a = # of clusters, b = # of cell types) with cluster ids indexed row-wise and cell types indexed column-wise, indicates number of cell types that are within each cluster

  • an a x c mean matrix (a = # of clusters, c = # of markers) with cluster ids indexed row-wise and markers indexed column-wise, indicates the mean marker expression for each cluster id

Return type:

tuple (pandas.DataFrame, pandas.DataFrame, pandas.DataFrame)

ark.analysis.spatial_analysis_utils

ark.analysis.spatial_analysis_utils.append_distance_features_to_dataset(fov, dist_matrix, cell_table, distance_columns)[source]

Appends selected distance features as ‘cells’ in distance matrix and cell table

Parameters:
  • fov (str) – the name of the FOV

  • dist_matrix (xarray.DataArray) – a cells x cells matrix with the euclidian distance between centers of corresponding cells for the FOV

  • cell_table (pd.DataFrame) – Table of cell features. Must contain provided distance columns

  • distance_columns (List[str]) – List of column names which store feature distance. These must exist in cell_table

Returns:

Updated cell_table, and distance matricie indexed by fov name

Return type:

(pd.DataFrame, dict)

ark.analysis.spatial_analysis_utils.calc_dist_matrix(label_dir, save_path, prefix='_whole_cell')[source]

Generate matrix of distances between center of pairs of cells.

Saves each one individually to save_path.

Parameters:
  • label_dir (str) – path to segmentation masks indexed by (fov, cell_id, cell_id, label)

  • save_path (str) – path to save the distance matrices

  • prefix (str) – the prefix used to identify label map files in label_dir

ark.analysis.spatial_analysis_utils.calculate_enrichment_stats(close_num, close_num_rand)[source]

Calculates z score and p values from spatial enrichment analysis.

Parameters:
  • close_num (numpy.ndarray) – marker x marker matrix with counts for cells positive for corresponding markers

  • close_num_rand (numpy.ndarray) – random positive marker counts for every permutation in the bootstrap

Returns:

xarray contining the following statistics for marker to marker enrichment

  • z: z scores for corresponding markers

  • muhat: predicted mean values of close_num_rand random distribution

  • sigmahat: predicted standard deviations of close_num_rand random distribution

  • p: p values for corresponding markers, for both positive and negative enrichment

  • h: matrix indicating whether corresponding marker interactions are significant

  • adj_p: fdh_br adjusted p values

Return type:

xarray.DataArray

ark.analysis.spatial_analysis_utils.compute_close_cell_num(dist_mat, dist_lim, analysis_type, current_fov_data=None, current_fov_channel_data=None, cluster_ids=None, cell_types_analyze=None, thresh_vec=None, cell_label_col='label', cell_type_col='cell_num')[source]

Finds positive cell labels and creates matrix with counts for cells positive for corresponding markers. Computes close_num matrix for both Cell Label and Threshold spatial analyses.

This function loops through all the included markers in the patient data and identifies cell labels positive for corresponding markers. It then subsets the distance matrix to only include these positive cells and records interactions based on whether cells are close to each other (within the dist_lim). It then stores the number of interactions in the index of close_num corresponding to both markers (for instance markers 1 and 2 would be in index [0, 1]).

Parameters:
  • dist_mat (numpy.ndarray) – cells x cells matrix with the euclidian distance between centers of corresponding cells

  • dist_lim (int) – threshold for spatial enrichment distance proximity

  • analysis_type (str) – type of analysis, must be either cluster or channel

  • current_fov_data (pandas.DataFrame) – data for specific patient in expression matrix

  • current_fov_channel_data (pandas.DataFrame) – data of only column markers for Channel Analysis

  • cluster_ids (numpy.ndarray) – all the cell phenotypes in Cluster Analysis

  • cell_types_analyze (list) – a list of the cell types we wish to analyze, if None we set it equal to all cell types

  • thresh_vec (numpy.ndarray) – matrix of thresholds column for markers

  • cell_label_col (str) – the name of the column containing the cell labels

  • cell_type_col (str) – the name of the column containing the cell type numbers

Returns:

2D array containing marker x marker matrix with counts for cells positive for corresponding markers, as well as a list of number of cell labels for marker 1

Return type:

numpy.ndarray

ark.analysis.spatial_analysis_utils.compute_close_cell_num_random(marker_nums, mark_pos_labels, dist_mat, dist_lim, bootstrap_num)[source]

Uses bootstrapping to permute cell labels randomly and records the number of close cells (within the dist_lim) in that random setup.

Parameters:
  • marker_nums (numpy.ndarray) – list of cell counts of each marker type

  • mark_pos_labels (list) – cell labels for each marker number

  • dist_mat (xr.DataArray) – cells x cells matrix with the euclidian distance between centers of corresponding cells. This can be indexed by cell label

  • dist_lim (int) – threshold for spatial enrichment distance proximity

  • bootstrap_num (int) – number of permutations

Returns:

Large matrix of random positive marker counts for every permutation in the bootstrap

Return type:

numpy.ndarray

ark.analysis.spatial_analysis_utils.compute_kmeans_inertia(neighbor_mat_data, min_k=2, max_k=10, seed=42)[source]
For a given neighborhood matrix, cluster and compute inertia using k-means clustering

from the range of k=min_k to max_k

Parameters:
  • neighbor_mat_data (pandas.DataFrame) – neighborhood matrix data with only the desired fovs

  • min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2

  • max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2

  • seed (int) – the random seed to set for k-means clustering

Returns:

contains a single dimension, cluster_num, which indicates the inertia when cluster_num was set as k for k-means clustering

Return type:

xarray.DataArray

ark.analysis.spatial_analysis_utils.compute_kmeans_silhouette(neighbor_mat_data, min_k=2, max_k=10, seed=42, subsample=None)[source]
For a given neighborhood matrix, cluster and compute Silhouette score using k-means

from the range of k=min_k to max_k

Parameters:
  • neighbor_mat_data (pandas.DataFrame) – neighborhood matrix data with only the desired fovs

  • min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2

  • max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2

  • seed (int) – the random seed to set for k-means clustering

  • subsample (int) – the number of cells that will be sampled from each neighborhood cluster for calculating Silhouette score If None, all cells will be used

Returns:

contains a single dimension, cluster_num, which indicates the Silhouette score when cluster_num was set as k for k-means clustering

Return type:

xarray.DataArray

ark.analysis.spatial_analysis_utils.compute_neighbor_counts(current_fov_neighborhood_data, dist_matrix, distlim, self_neighbor=False, cell_label_col='label', cluster_name_col='cell_meta_cluster')[source]

Calculates the number of neighbor phenotypes for each cell. The cell counts itself as a neighbor if self_neighbor=True.

Parameters:
  • current_fov_neighborhood_data (pandas.DataFrame) – data for the current fov, including the cell labels, cell phenotypes, and cell phenotype

  • dist_matrix (numpy.ndarray) – cells x cells matrix with the euclidian distance between centers of corresponding cells

  • distlim (int) – threshold for distance proximity

  • self_neighbor (bool) – If true, cell counts itself as a neighbor in the analysis.

  • cell_label_col (str) – Column name with the cell labels

  • cluster_name_col (str) – Column name with the cell types

Returns:

  • phenotype counts per cell

  • phenotype frequencies of counts per total for each cell

Return type:

tuple (pandas.DataFrame, pandas.DataFrame)

ark.analysis.spatial_analysis_utils.generate_cluster_labels(neighbor_mat_data, cluster_num, seed=42)[source]

Run k-means clustering with k=cluster_num

Give the same data, given several runs the clusters will always be the same, but the labels assigned will likely be different

Parameters:
  • neighbor_mat_data (pandas.DataFrame) – neighborhood matrix data with only the desired fovs

  • cluster_num (int) – the k we want to use when running k-means clustering

  • seed (int) – the random seed to set for k-means clustering

Returns:

the neighborhood cluster labels assigned to each cell in neighbor_mat_data

Return type:

numpy.ndarray

ark.analysis.spatial_analysis_utils.get_pos_cell_labels_channel(thresh, current_fov_channel_data, cell_labels, current_marker)[source]

For channel enrichment, finds positive labels that match the current phenotype or identifies cells with positive expression values for the current marker (greater than the marker threshold).

Parameters:
  • thresh (int) – current threshold for marker

  • current_fov_channel_data (pandas.DataFrame) – expression data for column markers for current patient

  • cell_labels (pandas.DataFrame) – the column of cell labels for current patient

  • current_marker (str) – the current marker that the positive labels are being found for

Returns:

List of all the positive labels

Return type:

list

ark.analysis.spatial_analysis_utils.get_pos_cell_labels_cluster(pheno, current_fov_neighborhood_data, cell_label_col, cell_type_col)[source]

For cluster enrichment, finds positive labels that match the current phenotype or identifies cells with positive expression values for the current marker (greater than the marker threshold).

Parameters:
  • pheno (str) – the current cell phenotype

  • current_fov_neighborhood_data (pandas.DataFrame) – data for the current patient

  • cell_label_col (str) – the name of the column indicating the cell label

  • cell_type_col (str) – the name of the column indicating the cell type

Returns:

List of all the positive labels

Return type:

list

ark.analysis.spatial_enrichment

ark.analysis.spatial_enrichment.calculate_channel_spatial_enrichment(fov, dist_matrix, marker_thresholds, all_data, excluded_channels=None, dist_lim=100, bootstrap_num=100, fov_col='fov', cell_label_col='label', context_col=None)[source]

Spatial enrichment analysis to find significant interactions between cells expressing different markers. Uses bootstrapping to permute cell labels randomly.

Parameters:
  • fov (str) – the name of the FOV

  • dist_matrix (xarray.DataArray) – a cells x cells matrix with the euclidian distance between centers of corresponding cells for the FOV

  • marker_thresholds (pd.DataFrame) – threshold values for positive marker expression

  • all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers

  • excluded_channels (list) – channels to be excluded from the analysis. Default is None.

  • dist_lim (int) – cell proximity threshold. Default is 100.

  • bootstrap_num (int) – number of permutations for bootstrap. Default is 1000.

  • fov_col (str) – column with the cell fovs.

  • cell_label_col (str) – cell label column name.

  • context_col (str) – column with context label.

Returns:

  • a tuple of closenum and closenumrand for the fov computed in the analysis

  • an xarray with dimensions (fovs, stats, num_channels, num_channels). The included stats variables for each fov are z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (tuple, xarray.DataArray)

ark.analysis.spatial_enrichment.calculate_cluster_spatial_enrichment(fov, all_data, dist_matrix, included_fovs=None, bootstrap_num=100, dist_lim=100, fov_col='fov', cluster_name_col='cell_meta_cluster', cluster_id_col='cell_num', cell_label_col='label', context_col=None, distance_cols=None)[source]

Spatial enrichment analysis based on cell phenotypes to find significant interactions between different cell types, looking for both positive and negative enrichment. Uses bootstrapping to permute cell labels randomly.

Parameters:
  • fov (str) – the name of the FOV

  • all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers

  • dist_matrix (xarray.DataArray) – a cells x cells matrix with the euclidian distance between centers of corresponding cells for the FOV

  • included_fovs (list) – patient labels to include in analysis. If argument is none, default is all labels used

  • bootstrap_num (int) – number of permutations for bootstrap. Default is 1000

  • dist_lim (int) – cell proximity threshold. Default is 100

  • fov_col (str) – column with the cell fovs.

  • cluster_name_col (str) – column with the cell types.

  • cluster_id_col (str) – column with the cell phenotype number.

  • cell_label_col (str) – column with the cell labels.

  • context_col (str) – column with context labels. If None, no context is assumed.

  • distance_cols (str) – column names of feature distances to include in analysis.

Returns:

  • a tuple of closenum and closenumrand for the fov computed in the analysis

  • an xarray with dimensions (fovs, stats, number of channels, number of channels). The included stats variables for each fov are: z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (tuple, xarray.DataArray)

ark.analysis.spatial_enrichment.generate_channel_spatial_enrichment_stats(label_dir, dist_mat_dir, marker_thresholds, all_data, suffix='_whole_cell', xr_channel_name='label', **kwargs)[source]

Wrapper function for batching calls to calculate_channel_spatial_enrichment over fovs

Parameters:
  • label_dir (str | Pathlike) – directory containing labeled tiffs

  • dist_mat_dir (str | Pathlike) – directory containing the distance matrices

  • marker_thresholds (pd.DataFrame) – threshold values for positive marker expression

  • all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers

  • suffix (str) – suffix for tiff file names

  • xr_channel_name (str) – channel name for label data array

  • **kwargs (dict) – args passed to calculate_channel_spatial_enrichment

Returns:

  • a list with each element consisting of a tuple of closenum and closenumrand for each fov included in the analysis

  • an xarray with dimensions (fovs, stats, num_channels, num_channels). The included stats variables for each fov are z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (list, xarray.DataArray)

ark.analysis.spatial_enrichment.generate_cluster_spatial_enrichment_stats(label_dir, dist_mat_dir, all_data, suffix='_whole_cell', xr_channel_name='label', **kwargs)[source]

Wrapper function for batching calls to calculate_cluster_spatial_enrichment over fovs

Parameters:
  • label_dir (str | Pathlike) – directory containing labeled tiffs

  • dist_mat_dir (str | Pathlike) – directory containing the distance matrices

  • all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers

  • suffix (str) – suffix for tiff file names

  • xr_channel_name (str) – channel name for label data array

  • **kwargs (dict) – args passed to calculate_cluster_spatial_enrichment

Returns:

  • a list with each element consisting of a tuple of closenum and closenumrand for each fov included in the analysis

  • an xarray with dimensions (fovs, stats, num_channels, num_channels). The included stats variables for each fov are z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (list, xarray.DataArray)

ark.analysis.visualize

ark.analysis.visualize.draw_boxplot(cell_data, col_name, col_split=None, split_vals=None, dpi=None, save_dir=None, save_file=None)[source]

Draws a boxplot for a given column, optionally with help from a split column

Parameters:
  • cell_data (pandas.DataFrame) – Dataframe containing columns with Patient ID and Cell Name

  • col_name (str) – Name of the column we wish to draw a box-and-whisker plot for

  • col_split (str) – If specified, used for additional box-and-whisker plot faceting

  • split_vals (list) – If specified, only visualize the specified values in the col_split column

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – If specified, a directory where we will save the plot

  • save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.visualize.draw_heatmap(data, x_labels, y_labels, dpi=None, center_val=None, min_val=None, max_val=None, cbar_ticks=None, colormap='vlag', row_colors=None, row_cluster=True, col_colors=None, col_cluster=True, left_start=None, right_start=None, w_spacing=None, h_spacing=None, save_dir=None, save_file=None)[source]

Plots the z scores between all phenotypes as a clustermap.

Parameters:
  • data (numpy.ndarray) – The data array to visualize

  • x_labels (list) – List of names displayed on horizontal axis

  • y_labels (list) – List of all names displayed on vertical axis

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • center_val (float) – value at which to center the heatmap

  • min_val (float) – minimum value the heatmap should take

  • max_val (float) – maximum value the heatmap should take

  • cbar_ticks (int) – list of values containing tick labels for the heatmap colorbar

  • colormap (str) – color scheme for visualization

  • row_colors (list) – Include these values as an additional color-coded cluster bar for row values

  • row_cluster (bool) – Whether to include dendrogram clustering for the rows

  • col_colors (list) – Include these values as an additional color-coded cluster bar for column values

  • col_cluster (bool) – Whether to include dendrogram clustering for the columns

  • left_start (float) – The position to set the left edge of the figure to (from 0-1)

  • right_start (float) – The position to set the right edge of the figure to (from 0-1)

  • w_spacing (float) – The amount of spacing to put between the subplots width-wise (from 0-1)

  • h_spacing (float) – The amount of spacing to put between the subplots height-wise (from 0-1)

  • save_dir (str) – If specified, a directory where we will save the plot

  • save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.visualize.get_sorted_data(cell_data, sort_by_first, sort_by_second, is_normalized=False)[source]

Gets the cell data and generates a new Sorted DataFrame with each row representing a patient and column representing Population categories

Parameters:
  • cell_data (pandas.DataFrame) – Dataframe containing columns with Patient ID and Cell Name

  • sort_by_first (str) – The first attribute we will be sorting our data by

  • sort_by_second (str) – The second attribute we will be sorting our data by

  • is_normalized (bool) – Boolean specifying whether to normalize cell counts or not, default is False

Returns:

DataFrame with rows and columns sorted by population

Return type:

pandas.DataFrame

ark.analysis.visualize.plot_barchart(data, title, x_label, y_label, color_map='jet', is_stacked=True, is_legend=True, legend_loc='center left', bbox_to_anchor=(1.0, 0.5), dpi=None, save_dir=None, save_file=None)[source]

A helper function to visualize_patient_population_distribution

Parameters:
  • data (pandas.DataFrame) – The data we wish to visualize

  • title (str) – The title of the graph

  • x_label (str) – The label on the x-axis

  • y_label (str) – The label on the y-axis

  • color_map (str) – The name of the Matplotlib colormap used

  • is_stacked (bool) – Whether we want a stacked barchart or not

  • is_legend (bool) – Whether we want a legend or not

  • legend_loc (str) – If is_legend is set, specify where we want the legend to be Ignored if is_legend is False

  • bbox_to_anchor (tuple) – If is_legend is set, specify the bounding box of the legend Ignored if is_legend is False

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None

  • save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.visualize.visualize_fov_graphs(cell_table, features, diff_mats, fovs, dpi=None, save_dir=None)[source]

Visualize the adjacency graph used to define neighboring environments in each field of view.

Parameters:
  • cell_table (dict) – A formatted cell table for use in spatial-LDA analysis. Specifically, this is the output from format_cell_table().

  • features (dict) – A featurized cell table. Specifically, this is the output from featurize_cell_table().

  • diff_mats (dict) – The difference matrices produced by create_difference_matrices().

  • fovs (list) – A list of field of view IDs to plot.

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None.

  • save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_fov_stats(data, metric='cellular_density', dpi=None, save_dir=None)[source]

Visualize area and cell count distributions for all field of views.

Parameters:
  • data (dict) – The dictionary of field of view metrics produced by fov_density().

  • metric (str) – One of “cellular_density”, “average_area”, or “total_cells”. See documentation of fov_density() for details.

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_neighbor_cluster_metrics(neighbor_cluster_stats, metric_name, dpi=None, save_dir=None)[source]

Visualize the cluster performance results of a neighborhood matrix

Parameters:
  • neighbor_cluster_stats (xarray.DataArray) – contains the desired statistic we wish to visualize, should have one coordinate called cluster_num labeled starting from 2

  • metric_name (str) – name of metric

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_patient_population_distribution(cell_data, patient_col_name, population_col_name, color_map='jet', show_total_count=True, show_distribution=True, show_proportion=True, dpi=None, save_dir=None)[source]

Plots the distribution of the population given by total count, direct count, and proportion

Parameters:
  • cell_data (pandas.DataFrame) – Dataframe containing columns with Patient ID and Cell Name

  • patient_col_name (str) – Name of column containing categorical Patient data

  • population_col_name (str) – Name of column in dataframe containing Population data

  • color_map (str) – Name of MatPlotLib ColorMap used. Default is jet

  • show_total_count (bool) – Boolean specifying whether to show graph of total population count, default is true

  • show_distribution (bool) – Boolean specifying whether to show graph of population distribution, default is true

  • show_proportion (bool) – Boolean specifying whether to show graph of total count, default is true

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_topic_eda(data, metric='gap_stat', gap_sd=True, k=None, transpose=False, scale=0.5, dpi=None, save_dir=None)[source]

Visualize the exploratory metrics for spatial-LDA topics

Parameters:
  • data (dict) – The dictionary of exploratory metrics produced by compute_topic_eda().

  • metric (str) – One of “gap_stat”, “inertia”, “silhouette”, or “cell_counts”.

  • gap_sd (bool) – If True, the standard error of the gap statistic is included in the plot.

  • k (int) – References a specific KMeans clustering with k clusters for visualizing the cell count heatmap.

  • transpose (bool) – Swap axes for cell_counts heatmap

  • scale (float) – Plot size scaling for cell_counts heatmap

  • dpi (float) – The resolution of the image to save, ignored if save_dir is None

  • save_dir (str) – Directory to save plots, default is None