ark.phenotyping¶

ark.phenotyping.cell_cluster_utils¶

ark.phenotyping.cell_cluster_utils.add_consensus_labels_cell_table(base_dir, cell_table_path, cell_som_input_data)[source]¶

Adds the consensus cluster labels to the cell table, then resaves data to {cell_table_path}_cell_labels.csv

Parameters:

base_dir (str) – The path to the data directory
cell_table_path (str) – Path of the cell table, needs to be created with Segment_Image_Data.ipynb
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training

ark.phenotyping.cell_cluster_utils.compute_cell_som_cluster_cols_avg(cell_cluster_data, cell_som_cluster_cols, cell_cluster_col, keep_count=False)[source]¶

For each cell SOM cluster, compute the average expression of all cell_som_cluster_cols

Parameters:

cell_cluster_data (pandas.DataFrame) – The cell data with SOM and/or meta labels, created by cluster_cells or cell_consensus_cluster
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_cluster_col (str) – Name of the cell cluster column to group by, should be 'cell_som_cluster' or 'cell_meta_cluster'
keep_count (bool) – Whether to include the cell counts or not, should only be set to True for visualization support

Returns:

Contains the average values for each column across cell SOM clusters

Return type:

pandas.DataFrame

ark.phenotyping.cell_cluster_utils.create_c2pc_data(fovs, pixel_data_path, cell_table_path, pixel_cluster_col='pixel_meta_cluster_rename')[source]¶

Create a matrix with each fov-cell label pair and their SOM pixel/meta cluster counts

Parameters:

fovs (list) – The list of fovs to subset on
pixel_data_path (str) – Path to directory with the pixel data with SOM and meta labels attached. Created by pixel_consensus_cluster.
cell_table_path (str) – Path to the cell table, needs to be created with Segment_Image_Data.ipynb
pixel_cluster_col (str) – The name of the pixel cluster column to count per cell Should be 'pixel_som_cluster' or 'pixel_meta_cluster_rename'

Returns:

pandas.DataFrame: cell x cluster counts of each pixel SOM/meta cluster per each cell
pandas.DataFrame: same as above, but normalized by cell_size

Return type:

tuple

ark.phenotyping.cell_meta_clustering¶

ark.phenotyping.cell_meta_clustering.apply_cell_meta_cluster_remapping(base_dir, cell_som_input_data, cell_remapped_name)[source]¶

Apply the meta cluster remapping to the data in cell_consensus_name. Resave the re-mapped consensus data to cell_consensus_name.

Parameters:

base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training
cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters

Returns:

The input data used for SOM training with renamed meta labels attached

Return type:

pandas.DataFrame

ark.phenotyping.cell_meta_clustering.cell_consensus_cluster(base_dir, cell_som_cluster_cols, cell_som_input_data, cell_som_expr_col_avg_name, max_k=20, cap=3, seed=42, overwrite=False)[source]¶

Run consensus clustering algorithm on cell-level data averaged across each cell SOM cluster.

Saves data with consensus cluster labels to cell_consensus_name.

Parameters:

base_dir (str) – The path to the data directory
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_input_data (pandas.DataFrame) – The data used for SOM training with SOM labels attached
cell_som_expr_col_avg_name (str) – The name of the file with the average expression per column across cell SOM clusters. Used to run consensus clustering on.
max_k (int) – The number of consensus clusters
cap (int) – z-score cap to use when hierarchical clustering
seed (int) – The random seed to set for consensus clustering
overwrite (bool) – If set, overwrites the meta cluster assignments if they exist

Returns:

cluster_helpers.PixieConsensusCluster: the consensus cluster object containing the SOM to meta mapping
pandas.DataFrame: the input data used for SOM training with meta labels attached

Return type:

tuple

ark.phenotyping.cell_meta_clustering.generate_meta_avg_files(base_dir, cell_cc, cell_som_cluster_cols, cell_som_input_data, cell_som_expr_col_avg_name, cell_meta_expr_col_avg_name, overwrite=False)[source]¶

Computes and saves the average cluster column expression across pixel meta clusters. Assigns meta cluster labels to the data stored in cell_som_expr_col_avg_name.

Parameters:

base_dir (str) – The path to the data directory
cell_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training. Will have meta labels appended after this process is run.
cell_som_expr_col_avg_name (str) – The average values of cell_som_cluster_cols per cell SOM cluster. Used to run consensus clustering on.
cell_meta_expr_col_avg_name (str) – Same as above except for cell meta clusters
overwrite (bool) – If set, regenerate the averages of cell_som_cluster_cols per meta cluster

ark.phenotyping.cell_meta_clustering.generate_remap_avg_count_files(base_dir, cell_som_input_data, cell_remapped_name, cell_som_cluster_cols, cell_som_expr_col_avg_name, cell_meta_expr_col_avg_name)[source]¶

Apply the cell cluster remapping to the average count files

Parameters:

base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training
cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_expr_col_avg_name (str) – The average values of cell_som_cluster_cols per cell SOM cluster
cell_meta_expr_col_avg_name (str) – Same as above except for cell meta clusters

ark.phenotyping.cell_som_clustering¶

ark.phenotyping.cell_som_clustering.cluster_cells(base_dir, cell_pysom, cell_som_cluster_cols, num_parallel_cells=1000000, overwrite=False)[source]¶

Uses trained SOM weights to assign cluster labels on full cell data.

Saves data with cluster labels to cell_cluster_name.

Parameters:

base_dir (str) – The path to the data directory
cell_pysom (cluster_helpers.CellSOMCluster) – The SOM cluster object containing the cell SOM weights
cell_som_cluster_cols (list) – The list of columns used for SOM training
num_parallel_cells (int) – How many cells to label in parallel at once
overwrite (bool) – If set, overwrites the SOM cluster assignments if they exist

Returns:

The cell data in cell_pysom.cell_data with SOM labels assigned

Return type:

pandas.DataFrame

ark.phenotyping.cell_som_clustering.generate_som_avg_files(base_dir, cell_som_input_data, cell_som_cluster_cols, cell_som_expr_col_avg_name, overwrite=False)[source]¶

Computes and saves the average expression of all cell_som_cluster_cols across cell SOM clusters.

Parameters:

base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training with SOM labels attached
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_expr_col_avg_name (str) – The name of the file to write the average expression per column across cell SOM clusters
overwrite (bool) – If set, regenerate the averages of cell_som_cluster_columns for SOM clusters

ark.phenotyping.cell_som_clustering.train_cell_som(fovs, base_dir, cell_table_path, cell_som_cluster_cols, cell_som_input_data, som_weights_name='cell_som_weights.feather', xdim=10, ydim=10, lr_start=0.05, lr_end=0.01, num_passes=1, seed=42, overwrite=False, normalize=True)[source]¶

Run the SOM training on the expression columns specified in cell_som_cluster_cols.

Saves the SOM weights to base_dir/som_weights_name.

Parameters:

fovs (list) – The list of fovs to subset on
base_dir (str) – The path to the data directories
cell_table_path (str) – Path of the cell table, needs to be created with Segment_Image_Data.ipynb
cell_som_cluster_cols (list) – The list of columns in cell_som_input_data_name to use for SOM training
cell_som_input_data (pandas.DataFrame) – The input data to use for SOM training
som_weights_name (str) – The name of the file to save the SOM weights to
xdim (int) – The number of x nodes to use for the SOM
ydim (int) – The number of y nodes to use for the SOM
lr_start (float) – The start learning rate for the SOM, decays to lr_end
lr_end (float) – The end learning rate for the SOM, decays from lr_start
num_passes (int) – The number of training passes to make through the dataset
seed (int) – The random seed to use for training the SOM
overwrite (bool) – If set, force retrains the SOM and overwrites the weights
normalize (bool) – Whether to perform 99.9% percentile normalization, default to True.

Returns:

The SOM cluster object containing the cell SOM weights

Return type:

cluster_helpers.CellSOMCluster

ark.phenotyping.cluster_helpers¶

class ark.phenotyping.cluster_helpers.CellSOMCluster(cell_data: pandas.DataFrame, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42, normalize=True)[source]¶

Bases: PixieSOMCluster

__init__(cell_data: pandas.DataFrame, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42, normalize=True)[source]¶

Creates a cell SOM cluster object derived from the abstract PixieSOMCluster

Parameters:

cell_data (pandas.DataFrame) – The dataset to use for training
weights_path (pathlib.Path) – The path to save the weights to.
fovs (List[str]) – The list of FOVs to subset the data on.
columns (List[str]) – The list of columns to subset the data on.
num_passes (int) – The number of SOM training passes to use.
xdim (int) – The number of SOM nodes on the x-axis.
ydim (int) – The number of SOM nodes on the y-axis.
lr_start (float) – The initial learning rate.
lr_end (float) – The learning rate to decay to.
seed (int) – The random seed to use.
normalize (bool) – Whether to perform 99.9% percentile normalization, default to True.

assign_som_clusters(num_parallel_cells=1000000) → pandas.DataFrame[source]¶

Assigns SOM clusters using weights to cell_data

Parameters:

external_data (pandas.DataFrame) – The dataset to assign SOM clusters to
num_parallel_cells (int) – Partition size of self.cell_data for assigning SOM labels

Returns:

cell_data with the SOM clusters assigned.

Return type:

pandas.DataFrame

normalize_data()[source]¶

Normalizes cell_data by the 99.9% value of each pixel cluster count column

Returns:: cell_data with columns normalized by the values in norm_data
Return type:: pandas.DataFrame

train_som(overwrite=False)[source]¶

Trains the SOM using cell_data

overwrite (bool):: If set, force retrains the SOM and overwrites the weights

class ark.phenotyping.cluster_helpers.ClusterClassTemplate(*args, **kwargs)[source]¶

Bases: Protocol

fit_predict() → None[source]¶

property n_clusters: int¶

class ark.phenotyping.cluster_helpers.ConsensusCluster(cluster: ClusterClassTemplate, L: int, K: int, H: int, resample_proportion: float = 0.5)[source]¶

Bases: object

__init__(cluster: ClusterClassTemplate, L: int, K: int, H: int, resample_proportion: float = 0.5)[source]¶

Implementation of Consensus clustering, following the paper https://link.springer.com/content/pdf/10.1023%2FA%3A1023949509487.pdf

Parameters:

cluster (Callable) –
Clustering class.

NOTE: the class is to be instantiated with parameter n_clusters, and possess a fit_predict method, which is invoked on data.
L (int) – Smallest number of clusters to try.
K (int) – Biggest number of clusters to try.
H (int) – Number of resamplings for each cluster number.
resample_proportion (float) – Percentage to sample.
Mk (numpy.ndarray) – Consensus matrices for each k (shape =(K,data.shape[0],data.shape[0])). NOTE: every consensus matrix is retained, like specified in the paper.
Ak (numpy.ndarray) – Area under CDF for each number of clusters. See paper: section 3.3.1. Consensus distribution.
deltaK (numpy.ndarray) – Changes in areas under CDF. See paper: section 3.3.1. Consensus distribution.
self.bestK (int) – Number of clusters that was found to be best.

fit(data: pandas.DataFrame, verbose: bool = False)[source]¶

Fits a consensus matrix for each number of clusters

Parameters:

data (pd.DataFrame) – The data in (examples,attributes) format.
verbose (bool) – Should print or not.

predict()[source]¶

Predicts on the consensus matrix, for best found cluster number.

Returns:: The consensus matrix prediction for self.bestK.
Return type:: numpy.ndarray

predict_data(data: pandas.DataFrame)[source]¶

Predicts on the data, for best found cluster number

Parameters:: data (pandas.DataFrame) – (examples,attributes) format
Returns:: The data matrix prediction for self.bestK.
Return type:: pandas.DataFrame

class ark.phenotyping.cluster_helpers.PixelSOMCluster(pixel_subset_folder: Path, norm_vals_path: Path, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶

Bases: PixieSOMCluster

__init__(pixel_subset_folder: Path, norm_vals_path: Path, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶

Creates a pixel SOM cluster object derived from the abstract PixieSOMCluster

Parameters:

pixel_subset_folder (pathlib.Path) – The name of the subsetted pixel data directory
norm_vals_path (pathlib.Path) – The name of the feather file containing the normalization values.
weights_path (pathlib.Path) – The path to save the weights to.
fovs (List[str]) – The list of FOVs to subset the data on.
columns (List[str]) – The list of columns to subset the data on.
num_passes (int) – The number of SOM training passes to use.
xdim (int) – The number of SOM nodes on the x-axis.
ydim (int) – The number of SOM nodes on the y-axis.
lr_start (float) – The initial learning rate.
lr_end (float) – The learning rate to decay to.
seed (int) – The random seed to use.

assign_som_clusters(external_data: pandas.DataFrame, normalize_data: bool = True, num_parallel_pixels: int = 1000000) → pandas.DataFrame[source]¶

Assigns SOM clusters using weights to a dataset

Parameters:

external_data (pandas.DataFrame) – The dataset to assign SOM clusters to
normalize_data (bool) – Whether or not to normalize external_data. Flag needed to prevent re-normalization of normalized dataset.
num_parallel_pixels (int) – Partition size of external_data for assigning SOM labels

Returns:

The dataset with the SOM clusters assigned.

Return type:

pandas.DataFrame

normalize_data(external_data: pandas.DataFrame) → pandas.DataFrame[source]¶

Uses norm_data to normalize a dataset

Parameters:: external_data (pandas.DataFrame) – The data to normalize
Returns:: The data with columns normalized by the values in norm_data
Return type:: pandas.DataFrame

train_som(overwrite=False)[source]¶

Trains the SOM using train_data

overwrite (bool):: If set, force retrains the SOM and overwrites the weights

class ark.phenotyping.cluster_helpers.PixieConsensusCluster(cluster_type: str, input_file: Path, columns: List[str], max_k: int = 20, cap: float = 3)[source]¶

Bases: object

__init__(cluster_type: str, input_file: Path, columns: List[str], max_k: int = 20, cap: float = 3)[source]¶

Constructs a generic ConsensusCluster pipeline object that makes use of Sagovic’s implementation of consensus clustering in Python.

Parameters:

cluster_type (str) – The type of data being run through consensus clustering. Must be either 'pixel' or 'cell'
input_file (pathlib.Path) – The average expression values per SOM cluster .csv, computed by ark.phenotyping.cluster_pixels or ark.phenotyping.cluster_cells depending on the type of data being generated.
columns (List[str]) – The list of columns to subset the data in input_file on for consensus clustering.
max_k (int) – The number of consensus clusters to use.
cap (float) – The value to cap the data in input_file at after z-score normalization. Data will be within the range [-cap, cap].

assign_consensus_labels(external_data: pandas.DataFrame) → pandas.DataFrame[source]¶

Takes an external dataset and applies ConsensusCluster mapping to it.

Parameters:: external_data (pandas.DataFrame) – A dataset which contains a '{self.cluster_type}_som_cluster' column.
Returns:: The external_data with a '{self.cluster_type}_meta_cluster' column attached.
Return type:: pandas.DataFrame

generate_som_to_meta_map()[source]¶

Maps each '{self.cluster_type}_som_cluster' to the meta cluster generated by ConsensusCluster.

Also assigns mapping to self.mapping for use in assign_consensus_labels.

run_consensus_clustering()[source]¶: Fits the meta clustering results using ConsensusCluster.

save_som_to_meta_map(save_path: Path)[source]¶

Saves the mapping generated by ConsensusCluster to save_path.

Parameters:: save_path (pathlib.Path) – The path to save self.mapping to.

scale_data()[source]¶

z-scores and caps input_data.

Scaling will be done on a per-column basis for all column names specified. Capping will truncate the data in the range [-cap, cap].

class ark.phenotyping.cluster_helpers.PixieSOMCluster(weights_path: Path, columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶

Bases: ABC

abstract __init__(weights_path: Path, columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶

Generic implementation of a pyFlowSOM runner

Parameters:

weights_path (pathlib.Path) – The path to save the weights to.
columns (List[str]) – The list of columns to subset the data on.
num_passes (int) – The number of SOM training passes to use.
xdim (int) – The number of SOM nodes on the x-axis.
ydim (int) – The number of SOM nodes on the y-axis.
lr_start (float) – The initial learning rate.
lr_end (float) – The learning rate to decay to
seed (int) – The random seed to use for training.

generate_som_clusters(external_data: pandas.DataFrame, num_parallel_obs: int = 1000000) → numpy.ndarray[source]¶

Uses the weights to generate SOM clusters for a dataset

Parameters:

external_data (pandas.DataFrame) – The dataset to generate SOM clusters for
num_parallel_obs (int) – Partition size of external_data for assigning SOM labels

Returns:

The SOM clusters generated for each pixel in external_data

Return type:

numpy.ndarray

abstract normalize_data() → pandas.DataFrame[source]¶

Generic implementation of the normalization process to use on the input data

Returns:: The data with columns normalized by the values in norm_data
Return type:: pandas.DataFrame

train_som(data: pandas.DataFrame)[source]¶

Trains the SOM on the data provided and saves the weights generated

Parameters:: data (pandas.DataFrame) – The input data to train the SOM on.

ark.phenotyping.cluster_helpers.verify_unique_meta_clusters(pixie_remapped_data: pandas.DataFrame, meta_cluster_type: Literal[‘pixel’, ‘cell’])[source]¶

Verifies that a mapping contains a unique renamed meta cluster for every base meta cluster

Parameters:

pixie_remapped_data (pandas.DataFrame) – Must have {pixel/cell}_meta_cluster and {pixel/cell}_meta_cluster_rename columns
meta_cluster_type (Literal[“pixel”, “cell”]) – Whether pixel or cell meta clusters are being validated

Raises:

ValueError – If there are duplicate {pixel/cell}_meta_cluster_rename entries for multiple {pixel/cell}_meta_cluster values

ark.phenotyping.pixel_cluster_utils¶

ark.phenotyping.pixel_cluster_utils.calculate_channel_percentiles(tiff_dir, fovs, channels, img_sub_folder, percentile)[source]¶

Calculates average percentile for each channel in the dataset

Parameters:

tiff_dir (str) – Name of the directory containing the tiff files
fovs (list) – List of fovs to include
channels (list) – List of channels to include
img_sub_folder (str) – Sub folder within each FOV containing image data
percentile (float) – The specific percentile to compute

Returns:

The mapping between each channel and its normalization value

Return type:

pd.DataFrame

ark.phenotyping.pixel_cluster_utils.calculate_pixel_intensity_percentile(tiff_dir, fovs, channels, img_sub_folder, channel_percentiles, percentile=0.05)[source]¶

Calculates average percentile per FOV for total signal in each pixel

Parameters:

tiff_dir (str) – Name of the directory containing the tiff files
fovs (list) – List of fovs to include
channels (list) – List of channels to include
img_sub_folder (str) – Sub folder within each FOV containing image data
channel_percentiles (pd.DataFrame) – The mapping between each channel and its normalization value Computed by calculate_channel_percentiles
percentile (float) – The pixel intensity percentile per FOV to average over

Returns:

The average percentile per FOV for total signal in each pixel

Return type:

float

ark.phenotyping.pixel_cluster_utils.check_for_modified_channels(tiff_dir, test_fov, img_sub_folder, channels)[source]¶

Checks to make sure the user selected newly modified channels

Parameters:

tiff_dir (str) – Name of the directory containing the tiff files
test_fov (str) – example fov used to check channel names
img_sub_folder (str) – sub-folder within each FOV containing image data
channels (list) – list of channels to use for analysis

ark.phenotyping.pixel_cluster_utils.compute_pixel_cluster_channel_avg(fovs, channels, base_dir, pixel_cluster_col, num_pixel_clusters, pixel_data_dir='pixel_mat_data', num_fovs_subset=100, seed=42, keep_count=False)[source]¶

Compute the average channel values across each pixel SOM cluster.

To improve performance, number of FOVs is downsampled by fov_subset_proportion

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directories
pixel_cluster_col (str) – Name of the column to group by
num_pixel_clusters (int) – The number of pixel clusters that are desired, if None then no fixed amount required
pixel_data_dir (str) – Name of the directory containing the pixel data with cluster labels
num_fovs_subset (float) – The number of FOVs to subset on. Note that if len(fovs) < num_fovs_subset, all of the FOVs will still be selected
seed (int) – The random seed to use for subsetting FOVs
keep_count (bool) – Whether to keep the count column when aggregating or not This should only be set to True for visualization purposes

Returns:

Contains the average channel values for each pixel SOM/meta cluster

Return type:

pandas.DataFrame

ark.phenotyping.pixel_cluster_utils.filter_with_nuclear_mask(fovs: List, tiff_dir: str, seg_dir: str, channel: str, nuc_seg_suffix: str = '_nuclear.tiff', img_sub_folder: str = None, exclude: bool = True)[source]¶

Filters out background staining using subcellular marker localization.

Non-nuclear signal is removed from nuclear markers and vice-versa for membrane markers.

Parameters:

fovs (list) – The list of fovs to filter
tiff_dir (str) – Name of the directory containing the tiff files
seg_dir (str) – Name of the directory containing the segmented files
channel (str) – Channel to apply filtering to
nuc_seg_suffix (str) – The suffix for the nuclear channel. (i.e. for “fov1”, a suffix of “_nuclear.tiff” would make a file named “fov1_nuclear.tiff”)
img_sub_folder (str) – Name of the subdirectory inside tiff_dir containing the tiff files. Set to None if there isn’t any.
exclude (bool) – Whether to filter out nuclear or membrane signal

ark.phenotyping.pixel_cluster_utils.find_fovs_missing_col(base_dir, data_dir, missing_col)[source]¶

Identify FOV names in data_dir without missing_col

Parameters:

base_dir (str) – The path to the data directories
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
missing_col (str) – Name of the column to identify

Returns:

List of FOVs without missing_col

Return type:

list

ark.phenotyping.pixel_cluster_utils.normalize_rows(pixel_data, channels, include_seg_label=True)[source]¶

Normalizes the rows of a pixel matrix by their sum

Parameters:

pixel_data (pandas.DataFrame) – The dataframe containing the pixel data for a given fov Includes channel and meta (fov, label, etc.) columns
channels (list) – List of channels to subset over
include_seg_label (bool) – Whether to include 'label' as a metadata column

Returns:

The pixel data with rows normalized and 0-sum rows removed

Return type:

pandas.DataFrame

ark.phenotyping.pixel_cluster_utils.smooth_channels(fovs, tiff_dir, img_sub_folder, channels, smooth_vals)[source]¶

Adds additional smoothing for selected channels as a preprocessing step

Parameters:

fovs (list) – List of fovs to process
tiff_dir (str) – Name of the directory containing the tiff files
img_sub_folder (str) – sub-folder within each FOV containing image data
channels (list) – list of channels to apply smoothing to
smooth_vals (list or int) – amount to smooth channels. If a single int, applies to all channels. Otherwise, a custom value per channel can be supplied

ark.phenotyping.pixel_meta_clustering¶

ark.phenotyping.pixel_meta_clustering.apply_pixel_meta_cluster_remapping(fovs, channels, base_dir, pixel_data_dir, pixel_remapped_name, multiprocess=False, batch_size=5)[source]¶

Apply the meta cluster remapping to the data in pixel_data_dir.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directories
pixel_data_dir (str) – Name of directory with the full pixel data. This data should also have the SOM cluster labels appended from cluster_pixels and the meta cluster labels appended from pixel_consensus_cluster.
pixel_remapped_name (str) – Name of the file containing the pixel SOM clusters to their remapped meta clusters
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel

ark.phenotyping.pixel_meta_clustering.generate_meta_avg_files(fovs, channels, base_dir, pixel_cc, data_dir='pixel_mat_data', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', pc_chan_avg_meta_cluster_name='pixel_channel_avg_meta_cluster.csv', num_fovs_subset=100, seed=42, overwrite=False)[source]¶

Computes and saves the average channel expression across pixel meta clusters. Assigns meta cluster labels to the data stored in pc_chan_avg_som_cluster_name.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
pixel_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping
data_dir (str) – Name of the directory which contains the full preprocessed pixel data. This data should also have the SOM cluster labels appended from cluster_pixels.
pc_chan_avg_som_cluster_name (str) – Name of file to save the channel-averaged results across all SOM clusters to
pc_chan_avg_meta_cluster_name (str) – Name of file to save the channel-averaged results across all meta clusters to
num_fovs_subset (float) – The number of FOVs to subset on for meta cluster channel averaging
seed (int) – The random seed to use for subsetting FOVs
overwrite (bool) – If set, force overwrites the existing average channel expression file if it exists

ark.phenotyping.pixel_meta_clustering.generate_remap_avg_files(fovs, channels, base_dir, pixel_data_dir, pixel_remapped_name, pc_chan_avg_som_cluster_name, pc_chan_avg_meta_cluster_name, num_fovs_subset=100, seed=42)[source]¶

Resaves the re-mapped consensus data to pixel_data_dir and re-runs the average channel expression per pixel meta cluster computation.

Re-maps the pixel SOM clusters to meta clusters in pc_chan_avg_som_cluster_name.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directories
pixel_data_dir (str) – Name of directory with the full pixel data. This data should also have the SOM cluster labels appended from cluster_pixels and the meta cluster labels appended from pixel_consensus_cluster.
pixel_remapped_name (str) – Name of the file containing the pixel SOM clusters to their remapped meta clusters
pc_chan_avg_som_cluster_name (str) – Name of the file containing the channel-averaged results across all SOM clusters
pc_chan_avg_meta_cluster_name (str) – Name of the file containing the channel-averaged results across all meta clusters
num_fovs_subset (float) – The number of FOVs to subset on for meta cluster channel averaging
seed (int) – The random seed to use for subsetting FOVs

ark.phenotyping.pixel_meta_clustering.pixel_consensus_cluster(fovs, channels, base_dir, max_k=20, cap=3, data_dir='pixel_mat_data', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', multiprocess=False, batch_size=5, seed=42, overwrite=False)[source]¶

Run consensus clustering algorithm on pixel-level summed data across channels Saves data with consensus cluster labels to data_dir.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
max_k (int) – The number of consensus clusters
cap (int) – z-score cap to use when hierarchical clustering
data_dir (str) – Name of the directory which contains the full preprocessed pixel data. This data should also have the SOM cluster labels appended from cluster_pixels.
pc_chan_avg_som_cluster_name (str) – Name of file to save the channel-averaged results across all SOM clusters to
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel, ignored if multiprocess is False
seed (int) – The random seed to set for consensus clustering
overwrite (bool) – If set, force overwrites the meta labels in all the FOVs

Returns:

The consensus cluster object containing the SOM to meta mapping

Return type:

cluster_helpers.PixieConsensusCluster

ark.phenotyping.pixel_meta_clustering.run_pixel_consensus_assignment(pixel_data_path, pixel_cc_obj, fov)[source]¶

Helper function to assign pixel consensus clusters

Parameters:

pixel_data_path (str) – The path to the pixel data directory
pixel_cc_obj (ark.phenotyping.cluster_helpers.PixieConsensusCluster) – The pixel consensus cluster object
fov (str) – The name of the FOV to process

Returns:

The name of the FOV as well as the return code

Return type:

tuple (str, int)

ark.phenotyping.pixel_meta_clustering.update_pixel_meta_labels(pixel_data_path, pixel_remapped_dict, pixel_renamed_meta_dict, fov)[source]¶

Helper function to reassign meta cluster names based on remapping scheme to a FOV

Parameters:

pixel_data_path (str) – The path to the pixel data drectory
pixel_remapped_dict (dict) – The mapping from pixel SOM cluster to pixel meta cluster label (not renamed)
pixel_renamed_meta_dict (dict) – The mapping from pixel meta cluster label to renamed pixel meta cluster name
fov (str) – The name of the FOV to process

Returns:

The name of the FOV as well as the return code

Return type:

tuple (str, int)

ark.phenotyping.pixel_som_clustering¶

ark.phenotyping.pixel_som_clustering.cluster_pixels(fovs, base_dir, pixel_pysom, data_dir='pixel_mat_data', multiprocess=False, batch_size=5, num_parallel_pixels=1000000, overwrite=False)[source]¶

Uses trained SOM weights to assign cluster labels on full pixel data.

Saves data with cluster labels to data_dir.

Parameters:

fovs (list) – The list of fovs to subset on
base_dir (str) – The path to the data directory
pixel_pysom (cluster_helpers.PixelSOMCluster) – The SOM cluster object containing the pixel SOM weights
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel, ignored if multiprocess is False
num_parallel_pixels (int) – How many pixels to label in parallel at once for each FOV
overwrite (bool) – If set, force overwrite the SOM labels in all the FOVs

ark.phenotyping.pixel_som_clustering.generate_som_avg_files(fovs, channels, base_dir, pixel_pysom, data_dir='pixel_data_dir', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', num_fovs_subset=100, require_all_som_clusters=True, seed=42, overwrite=False)[source]¶

Computes and saves the average channel expression across pixel SOM clusters.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
pixel_pysom (cluster_helpers.PixelSOMCluster) – The SOM cluster object containing the pixel SOM weights
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
pc_chan_avg_som_cluster_name (str) – The name of the file to save the average channel expression across all SOM clusters
num_fovs_subset (int) – The number of FOVs to subset on for SOM cluster channel averaging
require_all_som_clusters (bool) – Whether to require all SOM clusters to have at least one pixel assigned
seed (int) – The random seed to set for subsetting FOVs
overwrite (bool) – If set, force overwrite the existing average channel expression file if it exists

ark.phenotyping.pixel_som_clustering.run_pixel_som_assignment(pixel_data_path, pixel_pysom_obj, overwrite, num_parallel_pixels, fov)[source]¶

Helper function to assign pixel SOM cluster labels

Parameters:

pixel_data_path (str) – The path to the pixel data directory
pixel_pysom_obj (ark.phenotyping.cluster_helpers.PixieConsensusCluster) – The pixel SOM cluster object
overwrite (bool) – Whether to overwrite the pixel SOM clusters or not
num_parallel_pixels (int) – How many pixels to label in parallel at once for each FOV
fov (str) – The name of the FOV to process

Returns:

The name of the FOV as well as the return code

Return type:

tuple (str, int)

ark.phenotyping.pixel_som_clustering.train_pixel_som(fovs, channels, base_dir, subset_dir='pixel_mat_subsetted', norm_vals_name='post_rowsum_chan_norm.feather', som_weights_name='pixel_som_weights.feather', xdim=10, ydim=10, lr_start=0.05, lr_end=0.01, num_passes=1, seed=42, overwrite=False)[source]¶

Run the SOM training on the subsetted pixel data.

Saves SOM weights to base_dir/som_weights_name.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of markers to subset on
base_dir (str) – The path to the data directories
subset_dir (str) – The name of the subsetted data directory
norm_vals_name (str) – The name of the file to store the 99.9% normalization values
som_weights_name (str) – The name of the file to save the SOM weights to
xdim (int) – The number of x nodes to use for the SOM
ydim (int) – The number of y nodes to use for the SOM
lr_start (float) – The start learning rate for the SOM, decays to lr_end
lr_end (float) – The end learning rate for the SOM, decays from lr_start
num_passes (int) – The number of training passes to make through the dataset
seed (int) – The random seed to use for training the SOM
overwrite (bool) – If set, force retrains the SOM and overwrites the weights

Returns:

The SOM cluster object containing the pixel SOM weights

Return type:

cluster_helpers.PixelSOMCluster

ark.phenotyping.pixie_preprocessing¶

ark.phenotyping.pixie_preprocessing.create_fov_pixel_data(fov, channels, img_data, seg_labels, pixel_thresh_val, blur_factor=2, subset_proportion=0.1)[source]¶

Preprocess pixel data for one fov

Parameters:

fov (str) – Name of the fov to index
channels (list) – List of channels to subset over
img_data (numpy.ndarray) – Array representing image data for one fov
seg_labels (numpy.ndarray) – Array representing segmentation labels for one fov
pixel_thresh_val (float) – value used to determine per-pixel cutoff for total signal inclusion
blur_factor (int) – The sigma to set for the Gaussian blur
subset_proportion (float) – The proportion of pixels to take from each fov

Returns:

Contains the following:

pandas.DataFrame: Gaussian blurred and channel sum normalized pixel data for a fov
pandas.DataFrame: subset of the preprocessed pixel dataset for a fov

Return type:

tuple

ark.phenotyping.pixie_preprocessing.create_pixel_matrix(fovs, channels, base_dir, tiff_dir, seg_dir, img_sub_folder='TIFs', seg_suffix='_whole_cell.tiff', pixel_output_dir='pixel_output_dir', data_dir='pixel_mat_data', subset_dir='pixel_mat_subsetted', norm_vals_name_pre_rownorm='channel_norm_pre_rownorm.feather', norm_vals_name_post_rownorm='channel_norm_post_rownorm.feather', pixel_thresh_name='pixel_thresh.feather', channel_percentile_pre_rownorm=0.99, channel_percentile_post_rownorm=0.999, is_mibitiff=False, blur_factor=2, subset_proportion=0.1, seed=42, multiprocess=False, batch_size=5)[source]¶

For each fov, add a Gaussian blur to each channel and normalize channel sums for each pixel

Saves data to data_dir and subsetted data to subset_dir

Parameters:

fovs (list) – List of fovs to subset over
channels (list) – List of channels to subset over, applies only to pixel_mat_subset
base_dir (str) – The path to the data directories
tiff_dir (str) – Name of the directory containing the tiff files
seg_dir (str) – Name of the directory containing the segmented files. Set to None if no segmentation directory is available or desired.
img_sub_folder (str) – Name of the subdirectory inside tiff_dir containing the tiff files. Set to None if there isn’t any.
seg_suffix (str) – The suffix that the segmentation images use. Ignored if seg_dir is None.
pixel_output_dir (str) – The name of the data directory containing the pixel data to use for the clustering pipeline. data_dir and subset_dir should be placed here.
data_dir (str) – Name of the directory which contains the full preprocessed pixel data. Should be placed in pixel_output_dir.
subset_dir (str) – The name of the directory containing the subsetted pixel data. Should be placed in pixel_output_dir.
norm_vals_name_pre_rownorm (str) – The name of the file to store the pre-pixel-normalized norm values
norm_vals_name_post_rownorm (str) – The name of the file to store the post-pixel-normalized norm values
pixel_thresh_name (str) – The name of the file to store the pixel threshold value
channel_percentile_pre_rownorm (float) – Percentile used to normalize channels before pixel normalization
channel_percentile_post_rownorm (float) – Percentile used to normalize channels after pixel normalization
is_mibitiff (bool) – Whether to load the images from MIBITiff
blur_factor (int) – The sigma to set for the Gaussian blur
subset_proportion (float) – The proportion of pixels to take from each fov
seed (int) – The random seed to set for subsetting
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel, ignored if multiprocess is False

ark.phenotyping.pixie_preprocessing.preprocess_fov(base_dir, tiff_dir, data_dir, subset_dir, seg_dir, seg_suffix, img_sub_folder, is_mibitiff, channels, blur_factor, subset_proportion, pixel_thresh_val, seed, channel_norm_df, fov)[source]¶

Helper function to read in the FOV-level pixel data, run create_fov_pixel_data, and save the preprocessed data.

Parameters:

base_dir (str) – The path to the data directories
tiff_dir (str) – Name of the directory containing the tiff files
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
subset_dir (str) – The name of the directory containing the subsetted pixel data
seg_dir (str) – Name of the directory containing the segmented files. Set to None if no segmentation directory is available or desired.
seg_suffix (str) – The suffix that the segmentation images use. Ignored if seg_dir is None.
img_sub_folder (str) – Name of the subdirectory inside tiff_dir containing the tiff files. Set to None if there isn’t any.
is_mibitiff (bool) – Whether to load the images from MIBITiff
channels (list) – List of channels to subset over, applies only to pixel_mat_subset
blur_factor (int) – The sigma to set for the Gaussian blur
subset_proportion (float) – The proportion of pixels to take from each fov
pixel_thresh_val (float) – The value to normalize the pixels by
seed (int) – The random seed to set for subsetting
channel_norm_df (pandas.DataFrame) – The channel normalization values to use
fov (str) – The name of the FOV to preprocess

Returns:

The full preprocessed pixel dataset, needed for computing 99.9% normalized values in create_pixel_matrix

Return type:

pandas.DataFrame

ark.phenotyping.post_cluster_utils¶

ark.phenotyping.post_cluster_utils.create_mantis_project(cell_table: pandas.DataFrame, fovs: List[str], seg_dir: Union[Path, str], mask_dir: Union[Path, str], image_dir: Union[Path, str], mantis_dir: Union[Path, str], pop_col: str = 'cell_meta_cluster', fov_col: str = 'fov', label_col: str = 'label', seg_suffix_name: str = '_whole_cell.tiff') → None[source]¶

Creates a complete Mantis Project for viewing cell labels.

Parameters:

cell_table (pd.DataFrame) – DataFrame of extracted cell features and subtypes.
fovs (List[str]) – A list of FOVs to use for creating the project.
seg_dir (Union[pathlib.Path, str]) – The path to the directory containing the segmentation images.
mask_dir (Union[pathlib.Path, str]) – The path to the directory where the masks will be stored.
image_dir (Union[pathlib.Path, str]) – The path to the directory containing the raw image data.
mantis_dir (Union[pathlib.Path, str]) – The path to the directory where the mantis project will be created.
pop_col (str, optional) – The column name containing the distinct cell populations. Defaults to settings.CELL_TYPE ("cell_meta_cluster")
fov_col (str, optional) – The column name containing the FOV IDs. Defaults to settings.FOV_ID ("fov").
label_col (str, optional) – The column name containing the cell label. Defaults to settings.CELL_LABEL ("label").
seg_suffix_name (str, optional) – The suffix of the segmentation file and it’s file extension. Defaults to "_whole_cell.tiff".

ark.phenotyping.post_cluster_utils.generate_new_cluster_resolution(cell_table, cluster_col, new_cluster_col, cluster_mapping, save_path)[source]¶

Add new column of more broad cell cluster assignments to the cell table.

Parameters:

cell_table (pd.DataFrame) – cell table with clustered cell populations
cluster_col (str) – column containing the cell phenotype
new_cluster_col (str) – new column to create
cluster_mapping (dict) – dictionary with keys detailing the new cluster names and values explaining which cell types to group together
save_path (str) – where to save the new cell table

ark.phenotyping.post_cluster_utils.plot_hist_thresholds(cell_table, populations, marker, pop_col='cell_meta_cluster', threshold=None, percentile=0.999)[source]¶

Create histograms to compare marker distributions across cell populations

Parameters:

cell_table (pd.DataFrame) – cell table with clustered cell populations
populations (list) – populations to plot as stacked histograms
marker (str) – the marker used to generate the histograms
pop_col (str) – the column containing the names of the cell populations
threshold (float, None) – optional value to plot a horizontal line for visualization
percentile (float) – cap used to control x axis limits of the plot

ark.phenotyping.weighted_channel_comp¶

ark.phenotyping.weighted_channel_comp.compute_cell_cluster_weighted_channel_avg(fovs, channels, base_dir, weighted_cell_channel_name, cell_cluster_data, cell_cluster_col='cell_meta_cluster')[source]¶

Computes the average weighted marker expression for each cell cluster

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
weighted_cell_channel_name (str) – The name of the weighted cell table, created in 3_Pixie_Cluster_Cells.ipynb
cell_cluster_data (pandas.DataFrame) – Name of the file containing the cell data with cluster labels
cell_cluster_col (str) – Whether to aggregate by cell SOM or meta labels Needs to be either ‘cell_som_cluster’, or ‘cell_meta_cluster’

Returns:

Each cell cluster mapped to the average expression for each marker

Return type:

pandas.DataFrame

ark.phenotyping.weighted_channel_comp.compute_p2c_weighted_channel_avg(pixel_channel_avg, channels, cell_counts, fovs=None, pixel_cluster_col='pixel_meta_cluster_rename')[source]¶

Compute the average marker expression for each cell weighted by pixel cluster

This expression is weighted by the pixel SOM/meta cluster counts. So for each cell, marker expression vector is computed by:

pixel_cluster_n_count * avg_marker_exp_pixel_cluster_n + ...

These values are then normalized by the cell’s respective size.

Note that this function will only be used to correct overlapping signal for visualization.

Parameters:

pixel_channel_avg (pandas.DataFrame) – The average channel values for each pixel SOM/meta cluster Computed by compute_pixel_cluster_channel_avg
channels (list) – The list of channels to subset pixel_channel_avg by
cell_counts (pandas.DataFrame) – The dataframe listing the number of each type of pixel SOM/meta cluster per cell
fovs (list) – The list of fovs to include, if None provided all are used
pixel_cluster_col (str) – Name of the cell cluster column to group by Should be 'pixel_som_cluster' or 'pixel_meta_cluster_rename'

Returns:

Returns the average marker expression for each cell in the dataset

Return type:

pandas.DataFrame

ark.phenotyping.weighted_channel_comp.generate_remap_avg_wc_files(fovs, channels, base_dir, cell_som_input_data, cell_remapped_name, weighted_cell_channel_name, cell_som_cluster_channel_avg_name, cell_meta_cluster_channel_avg_name)[source]¶

Apply the cell cluster remapping to the average weighted channel files

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training. For weighted channel averaging, this should contain the number of pixel SOM/meta cluster counts of each cell, normalized by cell_size.
cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters
weighted_cell_channel_name (str) – The name of the file containing the weighted channel expression table
cell_som_cluster_channel_avg_name (str) – The name of the file to save the average weighted channel expression per cell SOM cluster
cell_meta_cluster_channel_avg_name (str) – Same as above except for cell meta clusters

ark.phenotyping.weighted_channel_comp.generate_wc_avg_files(fovs, channels, base_dir, cell_cc, cell_som_input_data, weighted_cell_channel_name='weighted_cell_channel.feather', cell_som_cluster_channel_avg_name='cell_som_cluster_channel_avg.csv', cell_meta_cluster_channel_avg_name='cell_meta_cluster_channel_avg.csv', overwrite=False)[source]¶

Generate the weighted channel average files per cell SOM and meta clusters.

When running cell clustering with pixel clusters generated from Pixie, the counts of each pixel cluster per cell is computed. These are multiplied by the average expression profile of each pixel cluster to determine weighted channel average. This computation is averaged by both cell SOM and meta cluster.

Parameters:

fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
cell_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping
cell_som_input_data (str) – The input data used for SOM training. For weighted channel averaging, it should contain the number of pixel SOM/meta cluster counts of each cell, normalized by cell_size.
weighted_cell_channel_name (str) – The name of the file containing the weighted channel expression table
cell_som_cluster_channel_avg_name (str) – The name of the file to save the average weighted channel expression per cell SOM cluster
cell_meta_cluster_channel_avg_name (str) – Same as above except for cell meta clusters
overwrite (bool) – If set, regenerate average weighted channel expression for SOM and meta clusters

ark.phenotyping.weighted_channel_comp.generate_weighted_channel_avg_heatmap(cell_cluster_channel_avg_path, cell_cluster_col, channels, raw_cmap, renamed_cmap, center_val=0, min_val=-3, max_val=3)[source]¶

Generates a z-scored heatmap of the average weighted channel expression per cell cluster

Parameters:

cell_cluster_channel_avg_path (str) – Path to the file containing the average weighted channel expression per cell cluster
cell_cluster_col (str) – The name of the cell cluster col, needs to be either ‘cell_som_cluster’ or ‘cell_meta_cluster_rename’
channels (str) – The list of channels to visualize
raw_cmap (dict) – Maps the raw meta cluster labels to their respective colors, created by generate_meta_cluster_colormap_dict
renamed_cmap (dict) – Maps the renamed meta cluster labels to their respective colors, created by generate_meta_cluster_colormap_dict
center_val (float) – value at which to center the heatmap
min_val (float) – minimum value the heatmap should take
max_val (float) – maximum value the heatmap should take