ark.analysis¶

ark.analysis.cell_neighborhood_stats¶

ark.analysis.cell_neighborhood_stats.calculate_mean_distance_to_all_cell_types(cell_table, dist_xr, k, cell_type_col='cell_meta_cluster', cell_label_col='label')[source]¶

Wrapper function to calculate mean distance of all cells against all cell types :param cell_table: Dataframe containing all cells and their cell type :type cell_table: pd.DataFrame :param dist_xr: Cell by cell distances for all cells :type dist_xr: xr.array :param k: Number of nearest neighbours :type k: int :param cell_type_col: column with the cell phenotype :type cell_type_col: str :param cell_label_col: column with the cell labels :type cell_label_col: str

Returns:: average distances
Return type:: pd.DataFrame

ark.analysis.cell_neighborhood_stats.calculate_mean_distance_to_cell_type(cell_table, dist_xr, cell_cluster, k, cell_type_col='cell_meta_cluster', cell_label_col='label')[source]¶

Function to calculate mean distance of all cells to a specified cell type :param cell_table: Dataframe containing all cells and their cell type :type cell_table: pd.DataFrame :param dist_xr: Cell by cell distances for all cells :type dist_xr: xr.array :param cell_cluster: Cell cluster to calculate distance to :type cell_cluster: str :param k: Number of nearest neighbours :type k: int :param cell_type_col: column with the cell phenotype :type cell_type_col: str :param cell_label_col: column with the cell labels :type cell_label_col: str

Returns:: mean distances for each cell to the cluster cells
Return type:: np.array

ark.analysis.cell_neighborhood_stats.compute_neighborhood_diversity(neighborhood_mat, cell_type_col)[source]¶

Generates a diversity score for each cell using the neighborhood matrix :param neighborhood_mat: the frequency neighbors matrix :type neighborhood_mat: pd.DataFrame :param cell_type_col: the specific name of the cell type column the matrix represents :type cell_type_col: string

Returns:: contains the fov, label, cell_type, and diversity_cell_type values for each cell
Return type:: pd.DataFrame

ark.analysis.cell_neighborhood_stats.generate_cell_distance_analysis(cell_table, dist_mat_dir, save_path, k, cell_type_col='cell_meta_cluster', fov_col='fov', cell_label_col='label')[source]¶: Creates a dataframe containing the average distance between a cell and other cells of each phenotype, based on the specified cell_type_col. :param cell_table: dataframe containing all cells and their cell type :type cell_table: pd.DataFrame :param dist_mat_dir: path to directory containing the distance matrix files :type dist_mat_dir: str :param save_path: path where to save the results to :type save_path: str :param k: Number of nearest neighbours :type k: int :param fov_col: column containing the image name :type fov_col: str :param cell_type_col: column with the cell phenotype :type cell_type_col: str :param cell_label_col: column with the cell labels :type cell_label_col: str

ark.analysis.cell_neighborhood_stats.generate_neighborhood_diversity_analysis(neighbors_mat_dir, pixel_radius, cell_type_columns)[source]¶

Generates a diversity score for each cell using the neighborhood matrix :param neighbors_mat_dir: directory containing the neighbors matrices :type neighbors_mat_dir: str :param pixel_radius: radius used to define the neighbors of each cell :type pixel_radius: int :param cell_type_columns: list of cell cluster columns to read in neighbors matrices for :type cell_type_columns: list

Returns:: contains diversity data calculated at each specified cell cluster level
Return type:: pd.DataFrame

ark.analysis.cell_neighborhood_stats.shannon_diversity(proportions)[source]¶

Calculates the shannon diversity index for the provided proportions of a community :param proportions: the proportions of each individual group :type proportions: np.array

Returns:: the diversity of neighborhood
Return type:: float

ark.analysis.dimensionality_reduction¶

ark.analysis.dimensionality_reduction.plot_dim_reduced_data(component_one, component_two, fig_id, hue, cell_data, title, title_fontsize=24, palette='Spectral', alpha=0.3, legend_type='full', bbox_to_anchor=(1.05, 1), legend_loc=2, legend_borderaxespad=0.0, dpi=None, save_dir=None, save_file=None)[source]¶

Helper function to visualize_dimensionality_reduction

Parameters:

component_one (pandas.Series) – the data corresponding to the first component
component_two (pandas.Series) – the data corresponding to the second component
fig_id (int) – the figure identifier for the visualization
hue (pandas.Series) – define the hue for each data point
cell_data (pandas.DataFrame) – Dataframe containing columns for dimensionality reduction and category
title (str) – the title we wish to set for the graph
title_fontsize (int) – the fontsize of the title we want
palette (str) – the color palette we wish to visualize with
alpha (float) – a value to define the opacity of the points visualized
legend_type (str) – what type of legend we wish to specify
bbox_to_anchor (tuple) – the bounding box of the legend
legend_loc (str) – an string describing where we want the legend located
legend_borderaxespad (float) – the pad between the axes and legend border
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None
save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.dimensionality_reduction.visualize_dimensionality_reduction(cell_data, columns, category, color_map='Spectral', algorithm='UMAP', dpi=None, save_dir=None)[source]¶

Plots the dimensionality reduction of specified population columns

Parameters:

cell_data (pandas.DataFrame) – Dataframe containing columns for dimensionality reduction and category
columns (list) – List of column names that are included for dimensionality reduction
category (str) – Name of column in dataframe containing population or patient data
color_map (str) – Name of MatPlotLib ColorMap used
algorithm (str) – Name of dimensionality reduction algorithm, must be UMAP, PCA, or tSNE
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None

ark.analysis.neighborhood_analysis¶

ark.analysis.neighborhood_analysis.compute_cell_ratios(neighbors_mat, target_cells, reference_cells, fov_list, bin_number=10, cell_col='cell_meta_cluster', fov_col='fov', label_col='label')[source]¶

Computes the target/reference and reference/target ratios for each FOV

Parameters:

neighbors_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix
target_cells (list) – invading cell phenotypes
reference_cells (list) – expected cell phenotypes
fov_list (list) – names of the fovs to compare
bin_number (int) – number of bins to use in histogram
cell_col (str) – column with the cell phenotype
fov_col (str) – column with the fovs
label_col (str) – column with the cell labels

Returns:

the target/reference ratios of each FOV
the reference/target ratios of each FOV

Return type:

tuple(list, list)

ark.analysis.neighborhood_analysis.compute_cluster_metrics_inertia(neighbor_mat, min_k=2, max_k=10, seed=42, included_fovs=None, fov_col='fov', label_col='label', cell_col='cell_meta_cluster')[source]¶

Produce k-means clustering metrics to help identify optimal number of clusters using: inertia

Parameters:

neighbor_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix
min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2
max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2
seed (int) – the random seed to set for k-means clustering
included_fovs (list) – fovs to include in analysis. If argument is none, default is all fovs used.
fov_col (str) – the name of the column in neighbor_mat indicating the fov
label_col (str) – the name of the column in neighbor_mat indicating the label
cell_col (str) – column with the cell phenotpype

Returns:

an xarray with dimensions (num_k_values) where num_k_values is the range of integers from 2 to max_k included, contains the metric scores for each value in num_k_values

Return type:

xarray.DataArray

ark.analysis.neighborhood_analysis.compute_cluster_metrics_silhouette(neighbor_mat, min_k=2, max_k=10, seed=42, included_fovs=None, fov_col='fov', label_col='label', cell_col='cell_meta_cluster', subsample=None)[source]¶

Produce k-means clustering metrics to help identify optimal number of clusters using: Silhouette score

Parameters:

neighbor_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix
min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2
max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2
seed (int) – the random seed to set for k-means clustering
included_fovs (list) – fovs to include in analysis. If argument is none, default is all fovs used.
fov_col (str) – the name of the column in neighbor_mat indicating the fov
label_col (str) – the name of the column in neighbor_mat indicating the label
cell_col (str) – column with the cell phenotype
subsample (int) – the number of cells that will be sampled from each neighborhood cluster for calculating Silhouette score If None, all cells will be used

Returns:

an xarray with dimensions (num_k_values) where num_k_values is the range of integers from 2 to max_k included, contains the metric scores for each value in num_k_values

Return type:

xarray.DataArray

ark.analysis.neighborhood_analysis.compute_mixing_score(fov_neighbors_mat, target_cells, reference_cells, mixing_type, ratio_threshold=5, cell_count_thresh=200, cell_col='cell_meta_cluster', fov_col='fov', label_col='label')[source]¶

Compute and return the mixing score for the specified target/reference cell types

Parameters:

fov_neighbors_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix and subsetted for 1 fov
target_cells (list) – invading cell phenotypes
reference_cells (list) – expected cell phenotypes
mixing_type (str) – “homogeneous” or “percent”, homogeneous is a symmetrical calculation
ratio_threshold (int) – maximum ratio of cell_types required to calculate a mixing score, under this labeled “cold”
cell_count_thresh (int) – minimum number of total cells from both populations to calculate a mixing score, under this labeled “cold”
cell_col (str) – column with the cell phenotype
fov_col (str) – column with the fovs
label_col (str) – column with the cell labels

Returns:

the mixing score for the FOV

Return type:

float

ark.analysis.neighborhood_analysis.create_neighborhood_matrix(all_data, dist_mat_dir, included_fovs=None, distlim=50, self_neighbor=False, fov_col='fov', cell_label_col='label', cell_type_col='cell_meta_cluster')[source]¶

Calculates the number of neighbor phenotypes for each cell.

Parameters:

all_data (pandas.DataFrame) – data for all fovs. Includes the columns for fov, label, and cell phenotype.
dist_mat_dir (str) – directory containing the distance matrices
included_fovs (list) – fovs to include in analysis. If argument is none, default is all fovs used.
distlim (int) – cell proximity threshold. Default is 50.
self_neighbor (bool) – If true, cell counts itself as a neighbor in the analysis. Default is False.
fov_col (str) – column with the cell fovs.
cell_label_col (str) – column with the cell labels.
cell_type_col (str) – column with the cell types.

Returns:

DataFrame containing phenotype counts per cell tupled with DataFrame containing phenotype frequencies of counts per phenotype/total phenotypes for each cell

Return type:

pandas.DataFrame

ark.analysis.neighborhood_analysis.generate_cluster_matrix_results(all_data, neighbor_mat, cluster_num, seed=42, excluded_channels=None, included_fovs=None, cluster_label_col='kmeans_neighborhood', fov_col='fov', cell_type_col='cell_meta_cluster', label_col='label', pre_channel_col='cell_size', post_channel_col='label')[source]¶

Generate the cluster info on all_data using k-means clustering on neighbor_mat.

cluster_num has to be picked based on visualizations from compute_cluster_metrics.

Parameters:

all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers
neighbor_mat (pandas.DataFrame) – a neighborhood matrix, created from create_neighborhood_matrix
cluster_num (int) – the optimal k to pass into k-means clustering to generate the final clusters and corresponding results
seed (int) – the random seed to set for k-means clustering
excluded_channels (list) – all channel names to be excluded from analysis
included_fovs (list) – fovs to include in analysis. If argument is None, default is all fovs used
cluster_label_col (str) – the name of the cluster label col we will create for neighborhood clusters
fov_col (str) – the name of the column in all_data and neighbor_mat indicating the fov
cell_type_col (str) – the name of the column in all_data indicating the cell type
label_col (str) – the name of the column in all_data indicating cell label
pre_channel_col (str) – the name of the column in all_data right before the first channel column
post_channel_col (str) – the name of the column in all_data right after the last channel column

Returns:

the expression matrix with the corresponding cluster labels attached, will only include fovs included in the analysis
an a x b count matrix (a = # of clusters, b = # of cell types) with cluster ids indexed row-wise and cell types indexed column-wise, indicates number of cell types that are within each cluster
an a x c mean matrix (a = # of clusters, c = # of markers) with cluster ids indexed row-wise and markers indexed column-wise, indicates the mean marker expression for each cluster id

Return type:

tuple (pandas.DataFrame, pandas.DataFrame, pandas.DataFrame)

ark.analysis.spatial_analysis_utils¶

ark.analysis.spatial_analysis_utils.append_distance_features_to_dataset(fov, dist_matrix, cell_table, distance_columns)[source]¶

Appends selected distance features as ‘cells’ in distance matrix and cell table

Parameters:

fov (str) – the name of the FOV
dist_matrix (xarray.DataArray) – a cells x cells matrix with the euclidian distance between centers of corresponding cells for the FOV
cell_table (pd.DataFrame) – Table of cell features. Must contain provided distance columns
distance_columns (List[str]) – List of column names which store feature distance. These must exist in cell_table

Returns:

Updated cell_table, and distance matricie indexed by fov name

Return type:

(pd.DataFrame, dict)

ark.analysis.spatial_analysis_utils.calc_dist_matrix(label_dir, save_path, prefix='_whole_cell')[source]¶

Generate matrix of distances between center of pairs of cells.

Saves each one individually to save_path.

Parameters:

label_dir (str) – path to segmentation masks indexed by (fov, cell_id, cell_id, label)
save_path (str) – path to save the distance matrices
prefix (str) – the prefix used to identify label map files in label_dir

ark.analysis.spatial_analysis_utils.calculate_enrichment_stats(close_num, close_num_rand)[source]¶

Calculates z score and p values from spatial enrichment analysis.

Parameters:

close_num (numpy.ndarray) – marker x marker matrix with counts for cells positive for corresponding markers
close_num_rand (numpy.ndarray) – random positive marker counts for every permutation in the bootstrap

Returns:

xarray contining the following statistics for marker to marker enrichment

z: z scores for corresponding markers
muhat: predicted mean values of close_num_rand random distribution
sigmahat: predicted standard deviations of close_num_rand random distribution
p: p values for corresponding markers, for both positive and negative enrichment
h: matrix indicating whether corresponding marker interactions are significant
adj_p: fdh_br adjusted p values

Return type:

xarray.DataArray

ark.analysis.spatial_analysis_utils.compute_close_cell_num(dist_mat, dist_lim, analysis_type, current_fov_data=None, current_fov_channel_data=None, cluster_ids=None, cell_types_analyze=None, thresh_vec=None, cell_label_col='label', cell_type_col='cell_num')[source]¶

Finds positive cell labels and creates matrix with counts for cells positive for corresponding markers. Computes close_num matrix for both Cell Label and Threshold spatial analyses.

This function loops through all the included markers in the patient data and identifies cell labels positive for corresponding markers. It then subsets the distance matrix to only include these positive cells and records interactions based on whether cells are close to each other (within the dist_lim). It then stores the number of interactions in the index of close_num corresponding to both markers (for instance markers 1 and 2 would be in index [0, 1]).

Parameters:

dist_mat (numpy.ndarray) – cells x cells matrix with the euclidian distance between centers of corresponding cells
dist_lim (int) – threshold for spatial enrichment distance proximity
analysis_type (str) – type of analysis, must be either cluster or channel
current_fov_data (pandas.DataFrame) – data for specific patient in expression matrix
current_fov_channel_data (pandas.DataFrame) – data of only column markers for Channel Analysis
cluster_ids (numpy.ndarray) – all the cell phenotypes in Cluster Analysis
cell_types_analyze (list) – a list of the cell types we wish to analyze, if None we set it equal to all cell types
thresh_vec (numpy.ndarray) – matrix of thresholds column for markers
cell_label_col (str) – the name of the column containing the cell labels
cell_type_col (str) – the name of the column containing the cell type numbers

Returns:

2D array containing marker x marker matrix with counts for cells positive for corresponding markers, as well as a list of number of cell labels for marker 1

Return type:

numpy.ndarray

ark.analysis.spatial_analysis_utils.compute_close_cell_num_random(marker_nums, mark_pos_labels, dist_mat, dist_lim, bootstrap_num)[source]¶

Uses bootstrapping to permute cell labels randomly and records the number of close cells (within the dist_lim) in that random setup.

Parameters:

marker_nums (numpy.ndarray) – list of cell counts of each marker type
mark_pos_labels (list) – cell labels for each marker number
dist_mat (xr.DataArray) – cells x cells matrix with the euclidian distance between centers of corresponding cells. This can be indexed by cell label
dist_lim (int) – threshold for spatial enrichment distance proximity
bootstrap_num (int) – number of permutations

Returns:

Large matrix of random positive marker counts for every permutation in the bootstrap

Return type:

numpy.ndarray

ark.analysis.spatial_analysis_utils.compute_kmeans_inertia(neighbor_mat_data, min_k=2, max_k=10, seed=42)[source]¶

For a given neighborhood matrix, cluster and compute inertia using k-means clustering: from the range of k=min_k to max_k

Parameters:

neighbor_mat_data (pandas.DataFrame) – neighborhood matrix data with only the desired fovs
min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2
max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2
seed (int) – the random seed to set for k-means clustering

Returns:

contains a single dimension, cluster_num, which indicates the inertia when cluster_num was set as k for k-means clustering

Return type:

xarray.DataArray

ark.analysis.spatial_analysis_utils.compute_kmeans_silhouette(neighbor_mat_data, min_k=2, max_k=10, seed=42, subsample=None)[source]¶

For a given neighborhood matrix, cluster and compute Silhouette score using k-means: from the range of k=min_k to max_k

Parameters:

neighbor_mat_data (pandas.DataFrame) – neighborhood matrix data with only the desired fovs
min_k (int) – the minimum k we want to generate cluster statistics for, must be at least 2
max_k (int) – the maximum k we want to generate cluster statistics for, must be at least 2
seed (int) – the random seed to set for k-means clustering
subsample (int) – the number of cells that will be sampled from each neighborhood cluster for calculating Silhouette score If None, all cells will be used

Returns:

contains a single dimension, cluster_num, which indicates the Silhouette score when cluster_num was set as k for k-means clustering

Return type:

xarray.DataArray

ark.analysis.spatial_analysis_utils.compute_neighbor_counts(current_fov_neighborhood_data, dist_matrix, distlim, self_neighbor=False, cell_label_col='label', cluster_name_col='cell_meta_cluster')[source]¶

Calculates the number of neighbor phenotypes for each cell. The cell counts itself as a neighbor if self_neighbor=True.

Parameters:

current_fov_neighborhood_data (pandas.DataFrame) – data for the current fov, including the cell labels, cell phenotypes, and cell phenotype
dist_matrix (numpy.ndarray) – cells x cells matrix with the euclidian distance between centers of corresponding cells
distlim (int) – threshold for distance proximity
self_neighbor (bool) – If true, cell counts itself as a neighbor in the analysis.
cell_label_col (str) – Column name with the cell labels
cluster_name_col (str) – Column name with the cell types

Returns:

phenotype counts per cell
phenotype frequencies of counts per total for each cell

Return type:

tuple (pandas.DataFrame, pandas.DataFrame)

ark.analysis.spatial_analysis_utils.generate_cluster_labels(neighbor_mat_data, cluster_num, seed=42)[source]¶

Run k-means clustering with k=cluster_num

Give the same data, given several runs the clusters will always be the same, but the labels assigned will likely be different

Parameters:

neighbor_mat_data (pandas.DataFrame) – neighborhood matrix data with only the desired fovs
cluster_num (int) – the k we want to use when running k-means clustering
seed (int) – the random seed to set for k-means clustering

Returns:

the neighborhood cluster labels assigned to each cell in neighbor_mat_data

Return type:

numpy.ndarray

ark.analysis.spatial_analysis_utils.get_pos_cell_labels_channel(thresh, current_fov_channel_data, cell_labels, current_marker)[source]¶

For channel enrichment, finds positive labels that match the current phenotype or identifies cells with positive expression values for the current marker (greater than the marker threshold).

Parameters:

thresh (int) – current threshold for marker
current_fov_channel_data (pandas.DataFrame) – expression data for column markers for current patient
cell_labels (pandas.DataFrame) – the column of cell labels for current patient
current_marker (str) – the current marker that the positive labels are being found for

Returns:

List of all the positive labels

Return type:

list

ark.analysis.spatial_analysis_utils.get_pos_cell_labels_cluster(pheno, current_fov_neighborhood_data, cell_label_col, cell_type_col)[source]¶

For cluster enrichment, finds positive labels that match the current phenotype or identifies cells with positive expression values for the current marker (greater than the marker threshold).

Parameters:

pheno (str) – the current cell phenotype
current_fov_neighborhood_data (pandas.DataFrame) – data for the current patient
cell_label_col (str) – the name of the column indicating the cell label
cell_type_col (str) – the name of the column indicating the cell type

Returns:

List of all the positive labels

Return type:

list

ark.analysis.spatial_enrichment¶

ark.analysis.spatial_enrichment.calculate_channel_spatial_enrichment(fov, dist_matrix, marker_thresholds, all_data, excluded_channels=None, dist_lim=100, bootstrap_num=100, fov_col='fov', cell_label_col='label', context_col=None)[source]¶

Spatial enrichment analysis to find significant interactions between cells expressing different markers. Uses bootstrapping to permute cell labels randomly.

Parameters:

fov (str) – the name of the FOV
dist_matrix (xarray.DataArray) – a cells x cells matrix with the euclidian distance between centers of corresponding cells for the FOV
marker_thresholds (pd.DataFrame) – threshold values for positive marker expression
all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers
excluded_channels (list) – channels to be excluded from the analysis. Default is None.
dist_lim (int) – cell proximity threshold. Default is 100.
bootstrap_num (int) – number of permutations for bootstrap. Default is 1000.
fov_col (str) – column with the cell fovs.
cell_label_col (str) – cell label column name.
context_col (str) – column with context label.

Returns:

a tuple of closenum and closenumrand for the fov computed in the analysis
an xarray with dimensions (fovs, stats, num_channels, num_channels). The included stats variables for each fov are z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (tuple, xarray.DataArray)

ark.analysis.spatial_enrichment.calculate_cluster_spatial_enrichment(fov, all_data, dist_matrix, included_fovs=None, bootstrap_num=100, dist_lim=100, fov_col='fov', cluster_name_col='cell_meta_cluster', cluster_id_col='cell_num', cell_label_col='label', context_col=None, distance_cols=None)[source]¶

Spatial enrichment analysis based on cell phenotypes to find significant interactions between different cell types, looking for both positive and negative enrichment. Uses bootstrapping to permute cell labels randomly.

Parameters:

fov (str) – the name of the FOV
all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers
dist_matrix (xarray.DataArray) – a cells x cells matrix with the euclidian distance between centers of corresponding cells for the FOV
included_fovs (list) – patient labels to include in analysis. If argument is none, default is all labels used
bootstrap_num (int) – number of permutations for bootstrap. Default is 1000
dist_lim (int) – cell proximity threshold. Default is 100
fov_col (str) – column with the cell fovs.
cluster_name_col (str) – column with the cell types.
cluster_id_col (str) – column with the cell phenotype number.
cell_label_col (str) – column with the cell labels.
context_col (str) – column with context labels. If None, no context is assumed.
distance_cols (str) – column names of feature distances to include in analysis.

Returns:

a tuple of closenum and closenumrand for the fov computed in the analysis
an xarray with dimensions (fovs, stats, number of channels, number of channels). The included stats variables for each fov are: z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (tuple, xarray.DataArray)

ark.analysis.spatial_enrichment.generate_channel_spatial_enrichment_stats(label_dir, dist_mat_dir, marker_thresholds, all_data, suffix='_whole_cell', xr_channel_name='label', **kwargs)[source]¶

Wrapper function for batching calls to calculate_channel_spatial_enrichment over fovs

Parameters:

label_dir (str | Pathlike) – directory containing labeled tiffs
dist_mat_dir (str | Pathlike) – directory containing the distance matrices
marker_thresholds (pd.DataFrame) – threshold values for positive marker expression
all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers
suffix (str) – suffix for tiff file names
xr_channel_name (str) – channel name for label data array
**kwargs (dict) – args passed to calculate_channel_spatial_enrichment

Returns:

a list with each element consisting of a tuple of closenum and closenumrand for each fov included in the analysis
an xarray with dimensions (fovs, stats, num_channels, num_channels). The included stats variables for each fov are z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (list, xarray.DataArray)

ark.analysis.spatial_enrichment.generate_cluster_spatial_enrichment_stats(label_dir, dist_mat_dir, all_data, suffix='_whole_cell', xr_channel_name='label', **kwargs)[source]¶

Wrapper function for batching calls to calculate_cluster_spatial_enrichment over fovs

Parameters:

label_dir (str | Pathlike) – directory containing labeled tiffs
dist_mat_dir (str | Pathlike) – directory containing the distance matrices
all_data (pandas.DataFrame) – data including fovs, cell labels, and cell expression matrix for all markers
suffix (str) – suffix for tiff file names
xr_channel_name (str) – channel name for label data array
**kwargs (dict) – args passed to calculate_cluster_spatial_enrichment

Returns:

a list with each element consisting of a tuple of closenum and closenumrand for each fov included in the analysis
an xarray with dimensions (fovs, stats, num_channels, num_channels). The included stats variables for each fov are z, muhat, sigmahat, p, h, adj_p, and cluster_names

Return type:

tuple (list, xarray.DataArray)

ark.analysis.visualize¶

ark.analysis.visualize.draw_boxplot(cell_data, col_name, col_split=None, split_vals=None, dpi=None, save_dir=None, save_file=None)[source]¶

Draws a boxplot for a given column, optionally with help from a split column

Parameters:

cell_data (pandas.DataFrame) – Dataframe containing columns with Patient ID and Cell Name
col_name (str) – Name of the column we wish to draw a box-and-whisker plot for
col_split (str) – If specified, used for additional box-and-whisker plot faceting
split_vals (list) – If specified, only visualize the specified values in the col_split column
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – If specified, a directory where we will save the plot
save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.visualize.draw_heatmap(data, x_labels, y_labels, dpi=None, center_val=None, min_val=None, max_val=None, cbar_ticks=None, colormap='vlag', row_colors=None, row_cluster=True, col_colors=None, col_cluster=True, left_start=None, right_start=None, w_spacing=None, h_spacing=None, save_dir=None, save_file=None)[source]¶

Plots the z scores between all phenotypes as a clustermap.

Parameters:

data (numpy.ndarray) – The data array to visualize
x_labels (list) – List of names displayed on horizontal axis
y_labels (list) – List of all names displayed on vertical axis
dpi (float) – The resolution of the image to save, ignored if save_dir is None
center_val (float) – value at which to center the heatmap
min_val (float) – minimum value the heatmap should take
max_val (float) – maximum value the heatmap should take
cbar_ticks (int) – list of values containing tick labels for the heatmap colorbar
colormap (str) – color scheme for visualization
row_colors (list) – Include these values as an additional color-coded cluster bar for row values
row_cluster (bool) – Whether to include dendrogram clustering for the rows
col_colors (list) – Include these values as an additional color-coded cluster bar for column values
col_cluster (bool) – Whether to include dendrogram clustering for the columns
left_start (float) – The position to set the left edge of the figure to (from 0-1)
right_start (float) – The position to set the right edge of the figure to (from 0-1)
w_spacing (float) – The amount of spacing to put between the subplots width-wise (from 0-1)
h_spacing (float) – The amount of spacing to put between the subplots height-wise (from 0-1)
save_dir (str) – If specified, a directory where we will save the plot
save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.visualize.get_sorted_data(cell_data, sort_by_first, sort_by_second, is_normalized=False)[source]¶

Gets the cell data and generates a new Sorted DataFrame with each row representing a patient and column representing Population categories

Parameters:

cell_data (pandas.DataFrame) – Dataframe containing columns with Patient ID and Cell Name
sort_by_first (str) – The first attribute we will be sorting our data by
sort_by_second (str) – The second attribute we will be sorting our data by
is_normalized (bool) – Boolean specifying whether to normalize cell counts or not, default is False

Returns:

DataFrame with rows and columns sorted by population

Return type:

pandas.DataFrame

ark.analysis.visualize.plot_barchart(data, title, x_label, y_label, color_map='jet', is_stacked=True, is_legend=True, legend_loc='center left', bbox_to_anchor=(1.0, 0.5), dpi=None, save_dir=None, save_file=None)[source]¶

A helper function to visualize_patient_population_distribution

Parameters:

data (pandas.DataFrame) – The data we wish to visualize
title (str) – The title of the graph
x_label (str) – The label on the x-axis
y_label (str) – The label on the y-axis
color_map (str) – The name of the Matplotlib colormap used
is_stacked (bool) – Whether we want a stacked barchart or not
is_legend (bool) – Whether we want a legend or not
legend_loc (str) – If is_legend is set, specify where we want the legend to be Ignored if is_legend is False
bbox_to_anchor (tuple) – If is_legend is set, specify the bounding box of the legend Ignored if is_legend is False
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None
save_file (str) – If save_dir specified, specify a file name you wish to save to. Ignored if save_dir is None

ark.analysis.visualize.visualize_fov_graphs(cell_table, features, diff_mats, fovs, dpi=None, save_dir=None)[source]¶

Visualize the adjacency graph used to define neighboring environments in each field of view.

Parameters:

cell_table (dict) – A formatted cell table for use in spatial-LDA analysis. Specifically, this is the output from format_cell_table().
features (dict) – A featurized cell table. Specifically, this is the output from featurize_cell_table().
diff_mats (dict) – The difference matrices produced by create_difference_matrices().
fovs (list) – A list of field of view IDs to plot.
dpi (float) – The resolution of the image to save, ignored if save_dir is None.
save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_fov_stats(data, metric='cellular_density', dpi=None, save_dir=None)[source]¶

Visualize area and cell count distributions for all field of views.

Parameters:

data (dict) – The dictionary of field of view metrics produced by fov_density().
metric (str) – One of “cellular_density”, “average_area”, or “total_cells”. See documentation of fov_density() for details.
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_neighbor_cluster_metrics(neighbor_cluster_stats, metric_name, dpi=None, save_dir=None)[source]¶

Visualize the cluster performance results of a neighborhood matrix

Parameters:

neighbor_cluster_stats (xarray.DataArray) – contains the desired statistic we wish to visualize, should have one coordinate called cluster_num labeled starting from 2
metric_name (str) – name of metric
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_patient_population_distribution(cell_data, patient_col_name, population_col_name, color_map='jet', show_total_count=True, show_distribution=True, show_proportion=True, dpi=None, save_dir=None)[source]¶

Plots the distribution of the population given by total count, direct count, and proportion

Parameters:

cell_data (pandas.DataFrame) – Dataframe containing columns with Patient ID and Cell Name
patient_col_name (str) – Name of column containing categorical Patient data
population_col_name (str) – Name of column in dataframe containing Population data
color_map (str) – Name of MatPlotLib ColorMap used. Default is jet
show_total_count (bool) – Boolean specifying whether to show graph of total population count, default is true
show_distribution (bool) – Boolean specifying whether to show graph of population distribution, default is true
show_proportion (bool) – Boolean specifying whether to show graph of total count, default is true
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None

ark.analysis.visualize.visualize_topic_eda(data, metric='gap_stat', gap_sd=True, k=None, transpose=False, scale=0.5, dpi=None, save_dir=None)[source]¶

Visualize the exploratory metrics for spatial-LDA topics

Parameters:

data (dict) – The dictionary of exploratory metrics produced by compute_topic_eda().
metric (str) – One of “gap_stat”, “inertia”, “silhouette”, or “cell_counts”.
gap_sd (bool) – If True, the standard error of the gap statistic is included in the plot.
k (int) – References a specific KMeans clustering with k clusters for visualizing the cell count heatmap.
transpose (bool) – Swap axes for cell_counts heatmap
scale (float) – Plot size scaling for cell_counts heatmap
dpi (float) – The resolution of the image to save, ignored if save_dir is None
save_dir (str) – Directory to save plots, default is None