ark.phenotyping

ark.phenotyping.cell_cluster_utils

ark.phenotyping.cell_cluster_utils.add_consensus_labels_cell_table(base_dir, cell_table_path, cell_som_input_data)[source]

Adds the consensus cluster labels to the cell table, then resaves data to {cell_table_path}_cell_labels.csv

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_table_path (str) – Path of the cell table, needs to be created with Segment_Image_Data.ipynb

  • cell_som_input_data (pandas.DataFrame) – The input data used for SOM training

ark.phenotyping.cell_cluster_utils.compute_cell_som_cluster_cols_avg(cell_cluster_data, cell_som_cluster_cols, cell_cluster_col, keep_count=False)[source]

For each cell SOM cluster, compute the average expression of all cell_som_cluster_cols

Parameters:
  • cell_cluster_data (pandas.DataFrame) – The cell data with SOM and/or meta labels, created by cluster_cells or cell_consensus_cluster

  • cell_som_cluster_cols (list) – The list of columns used for SOM training

  • cell_cluster_col (str) – Name of the cell cluster column to group by, should be 'cell_som_cluster' or 'cell_meta_cluster'

  • keep_count (bool) – Whether to include the cell counts or not, should only be set to True for visualization support

Returns:

Contains the average values for each column across cell SOM clusters

Return type:

pandas.DataFrame

ark.phenotyping.cell_cluster_utils.create_c2pc_data(fovs, pixel_data_path, cell_table_path, pixel_cluster_col='pixel_meta_cluster_rename')[source]

Create a matrix with each fov-cell label pair and their SOM pixel/meta cluster counts

Parameters:
  • fovs (list) – The list of fovs to subset on

  • pixel_data_path (str) – Path to directory with the pixel data with SOM and meta labels attached. Created by pixel_consensus_cluster.

  • cell_table_path (str) – Path to the cell table, needs to be created with Segment_Image_Data.ipynb

  • pixel_cluster_col (str) – The name of the pixel cluster column to count per cell Should be 'pixel_som_cluster' or 'pixel_meta_cluster_rename'

Returns:

Return type:

tuple

ark.phenotyping.cell_meta_clustering

ark.phenotyping.cell_meta_clustering.apply_cell_meta_cluster_remapping(base_dir, cell_som_input_data, cell_remapped_name)[source]

Apply the meta cluster remapping to the data in cell_consensus_name. Resave the re-mapped consensus data to cell_consensus_name.

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_som_input_data (pandas.DataFrame) – The input data used for SOM training

  • cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters

Returns:

The input data used for SOM training with renamed meta labels attached

Return type:

pandas.DataFrame

ark.phenotyping.cell_meta_clustering.cell_consensus_cluster(base_dir, cell_som_cluster_cols, cell_som_input_data, cell_som_expr_col_avg_name, max_k=20, cap=3, seed=42, overwrite=False)[source]

Run consensus clustering algorithm on cell-level data averaged across each cell SOM cluster.

Saves data with consensus cluster labels to cell_consensus_name.

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_som_cluster_cols (list) – The list of columns used for SOM training

  • cell_som_input_data (pandas.DataFrame) – The data used for SOM training with SOM labels attached

  • cell_som_expr_col_avg_name (str) – The name of the file with the average expression per column across cell SOM clusters. Used to run consensus clustering on.

  • max_k (int) – The number of consensus clusters

  • cap (int) – z-score cap to use when hierarchical clustering

  • seed (int) – The random seed to set for consensus clustering

  • overwrite (bool) – If set, overwrites the meta cluster assignments if they exist

Returns:

  • cluster_helpers.PixieConsensusCluster: the consensus cluster object containing the SOM to meta mapping

  • pandas.DataFrame: the input data used for SOM training with meta labels attached

Return type:

tuple

ark.phenotyping.cell_meta_clustering.generate_meta_avg_files(base_dir, cell_cc, cell_som_cluster_cols, cell_som_input_data, cell_som_expr_col_avg_name, cell_meta_expr_col_avg_name, overwrite=False)[source]

Computes and saves the average cluster column expression across pixel meta clusters. Assigns meta cluster labels to the data stored in cell_som_expr_col_avg_name.

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping

  • cell_som_cluster_cols (list) – The list of columns used for SOM training

  • cell_som_input_data (pandas.DataFrame) – The input data used for SOM training. Will have meta labels appended after this process is run.

  • cell_som_expr_col_avg_name (str) – The average values of cell_som_cluster_cols per cell SOM cluster. Used to run consensus clustering on.

  • cell_meta_expr_col_avg_name (str) – Same as above except for cell meta clusters

  • overwrite (bool) – If set, regenerate the averages of cell_som_cluster_cols per meta cluster

ark.phenotyping.cell_meta_clustering.generate_remap_avg_count_files(base_dir, cell_som_input_data, cell_remapped_name, cell_som_cluster_cols, cell_som_expr_col_avg_name, cell_meta_expr_col_avg_name)[source]

Apply the cell cluster remapping to the average count files

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_som_input_data (pandas.DataFrame) – The input data used for SOM training

  • cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters

  • cell_som_cluster_cols (list) – The list of columns used for SOM training

  • cell_som_expr_col_avg_name (str) – The average values of cell_som_cluster_cols per cell SOM cluster

  • cell_meta_expr_col_avg_name (str) – Same as above except for cell meta clusters

ark.phenotyping.cell_som_clustering

ark.phenotyping.cell_som_clustering.cluster_cells(base_dir, cell_pysom, cell_som_cluster_cols, num_parallel_cells=1000000, overwrite=False)[source]

Uses trained SOM weights to assign cluster labels on full cell data.

Saves data with cluster labels to cell_cluster_name.

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_pysom (cluster_helpers.CellSOMCluster) – The SOM cluster object containing the cell SOM weights

  • cell_som_cluster_cols (list) – The list of columns used for SOM training

  • num_parallel_cells (int) – How many cells to label in parallel at once

  • overwrite (bool) – If set, overwrites the SOM cluster assignments if they exist

Returns:

The cell data in cell_pysom.cell_data with SOM labels assigned

Return type:

pandas.DataFrame

ark.phenotyping.cell_som_clustering.generate_som_avg_files(base_dir, cell_som_input_data, cell_som_cluster_cols, cell_som_expr_col_avg_name, overwrite=False)[source]

Computes and saves the average expression of all cell_som_cluster_cols across cell SOM clusters.

Parameters:
  • base_dir (str) – The path to the data directory

  • cell_som_input_data (pandas.DataFrame) – The input data used for SOM training with SOM labels attached

  • cell_som_cluster_cols (list) – The list of columns used for SOM training

  • cell_som_expr_col_avg_name (str) – The name of the file to write the average expression per column across cell SOM clusters

  • overwrite (bool) – If set, regenerate the averages of cell_som_cluster_columns for SOM clusters

ark.phenotyping.cell_som_clustering.train_cell_som(fovs, base_dir, cell_table_path, cell_som_cluster_cols, cell_som_input_data, som_weights_name='cell_som_weights.feather', xdim=10, ydim=10, lr_start=0.05, lr_end=0.01, num_passes=1, seed=42, overwrite=False, normalize=True)[source]

Run the SOM training on the expression columns specified in cell_som_cluster_cols.

Saves the SOM weights to base_dir/som_weights_name.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • base_dir (str) – The path to the data directories

  • cell_table_path (str) – Path of the cell table, needs to be created with Segment_Image_Data.ipynb

  • cell_som_cluster_cols (list) – The list of columns in cell_som_input_data_name to use for SOM training

  • cell_som_input_data (pandas.DataFrame) – The input data to use for SOM training

  • som_weights_name (str) – The name of the file to save the SOM weights to

  • xdim (int) – The number of x nodes to use for the SOM

  • ydim (int) – The number of y nodes to use for the SOM

  • lr_start (float) – The start learning rate for the SOM, decays to lr_end

  • lr_end (float) – The end learning rate for the SOM, decays from lr_start

  • num_passes (int) – The number of training passes to make through the dataset

  • seed (int) – The random seed to use for training the SOM

  • overwrite (bool) – If set, force retrains the SOM and overwrites the weights

  • normalize (bool) – Whether to perform 99.9% percentile normalization, default to True.

Returns:

The SOM cluster object containing the cell SOM weights

Return type:

cluster_helpers.CellSOMCluster

ark.phenotyping.cluster_helpers

class ark.phenotyping.cluster_helpers.CellSOMCluster(cell_data: pandas.DataFrame, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42, normalize=True)[source]

Bases: PixieSOMCluster

__init__(cell_data: pandas.DataFrame, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42, normalize=True)[source]

Creates a cell SOM cluster object derived from the abstract PixieSOMCluster

Parameters:
  • cell_data (pandas.DataFrame) – The dataset to use for training

  • weights_path (pathlib.Path) – The path to save the weights to.

  • fovs (List[str]) – The list of FOVs to subset the data on.

  • columns (List[str]) – The list of columns to subset the data on.

  • num_passes (int) – The number of SOM training passes to use.

  • xdim (int) – The number of SOM nodes on the x-axis.

  • ydim (int) – The number of SOM nodes on the y-axis.

  • lr_start (float) – The initial learning rate.

  • lr_end (float) – The learning rate to decay to.

  • seed (int) – The random seed to use.

  • normalize (bool) – Whether to perform 99.9% percentile normalization, default to True.

assign_som_clusters(num_parallel_cells=1000000) pandas.DataFrame[source]

Assigns SOM clusters using weights to cell_data

Parameters:
  • external_data (pandas.DataFrame) – The dataset to assign SOM clusters to

  • num_parallel_cells (int) – Partition size of self.cell_data for assigning SOM labels

Returns:

cell_data with the SOM clusters assigned.

Return type:

pandas.DataFrame

normalize_data()[source]

Normalizes cell_data by the 99.9% value of each pixel cluster count column

Returns:

cell_data with columns normalized by the values in norm_data

Return type:

pandas.DataFrame

train_som(overwrite=False)[source]

Trains the SOM using cell_data

overwrite (bool):

If set, force retrains the SOM and overwrites the weights

class ark.phenotyping.cluster_helpers.ClusterClassTemplate(*args, **kwargs)[source]

Bases: Protocol

fit_predict() None[source]
property n_clusters: int
class ark.phenotyping.cluster_helpers.ConsensusCluster(cluster: ClusterClassTemplate, L: int, K: int, H: int, resample_proportion: float = 0.5)[source]

Bases: object

__init__(cluster: ClusterClassTemplate, L: int, K: int, H: int, resample_proportion: float = 0.5)[source]

Implementation of Consensus clustering, following the paper https://link.springer.com/content/pdf/10.1023%2FA%3A1023949509487.pdf

Parameters:
  • cluster (Callable) –

    Clustering class.

    NOTE: the class is to be instantiated with parameter n_clusters, and possess a fit_predict method, which is invoked on data.

  • L (int) – Smallest number of clusters to try.

  • K (int) – Biggest number of clusters to try.

  • H (int) – Number of resamplings for each cluster number.

  • resample_proportion (float) – Percentage to sample.

  • Mk (numpy.ndarray) – Consensus matrices for each k (shape =(K,data.shape[0],data.shape[0])). NOTE: every consensus matrix is retained, like specified in the paper.

  • Ak (numpy.ndarray) – Area under CDF for each number of clusters. See paper: section 3.3.1. Consensus distribution.

  • deltaK (numpy.ndarray) – Changes in areas under CDF. See paper: section 3.3.1. Consensus distribution.

  • self.bestK (int) – Number of clusters that was found to be best.

fit(data: pandas.DataFrame, verbose: bool = False)[source]

Fits a consensus matrix for each number of clusters

Parameters:
  • data (pd.DataFrame) – The data in (examples,attributes) format.

  • verbose (bool) – Should print or not.

predict()[source]

Predicts on the consensus matrix, for best found cluster number.

Returns:

The consensus matrix prediction for self.bestK.

Return type:

numpy.ndarray

predict_data(data: pandas.DataFrame)[source]

Predicts on the data, for best found cluster number

Parameters:

data (pandas.DataFrame) – (examples,attributes) format

Returns:

The data matrix prediction for self.bestK.

Return type:

pandas.DataFrame

class ark.phenotyping.cluster_helpers.PixelSOMCluster(pixel_subset_folder: Path, norm_vals_path: Path, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]

Bases: PixieSOMCluster

__init__(pixel_subset_folder: Path, norm_vals_path: Path, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]

Creates a pixel SOM cluster object derived from the abstract PixieSOMCluster

Parameters:
  • pixel_subset_folder (pathlib.Path) – The name of the subsetted pixel data directory

  • norm_vals_path (pathlib.Path) – The name of the feather file containing the normalization values.

  • weights_path (pathlib.Path) – The path to save the weights to.

  • fovs (List[str]) – The list of FOVs to subset the data on.

  • columns (List[str]) – The list of columns to subset the data on.

  • num_passes (int) – The number of SOM training passes to use.

  • xdim (int) – The number of SOM nodes on the x-axis.

  • ydim (int) – The number of SOM nodes on the y-axis.

  • lr_start (float) – The initial learning rate.

  • lr_end (float) – The learning rate to decay to.

  • seed (int) – The random seed to use.

assign_som_clusters(external_data: pandas.DataFrame, normalize_data: bool = True, num_parallel_pixels: int = 1000000) pandas.DataFrame[source]

Assigns SOM clusters using weights to a dataset

Parameters:
  • external_data (pandas.DataFrame) – The dataset to assign SOM clusters to

  • normalize_data (bool) – Whether or not to normalize external_data. Flag needed to prevent re-normalization of normalized dataset.

  • num_parallel_pixels (int) – Partition size of external_data for assigning SOM labels

Returns:

The dataset with the SOM clusters assigned.

Return type:

pandas.DataFrame

normalize_data(external_data: pandas.DataFrame) pandas.DataFrame[source]

Uses norm_data to normalize a dataset

Parameters:

external_data (pandas.DataFrame) – The data to normalize

Returns:

The data with columns normalized by the values in norm_data

Return type:

pandas.DataFrame

train_som(overwrite=False)[source]

Trains the SOM using train_data

overwrite (bool):

If set, force retrains the SOM and overwrites the weights

class ark.phenotyping.cluster_helpers.PixieConsensusCluster(cluster_type: str, input_file: Path, columns: List[str], max_k: int = 20, cap: float = 3)[source]

Bases: object

__init__(cluster_type: str, input_file: Path, columns: List[str], max_k: int = 20, cap: float = 3)[source]

Constructs a generic ConsensusCluster pipeline object that makes use of Sagovic’s implementation of consensus clustering in Python.

Parameters:
  • cluster_type (str) – The type of data being run through consensus clustering. Must be either 'pixel' or 'cell'

  • input_file (pathlib.Path) – The average expression values per SOM cluster .csv, computed by ark.phenotyping.cluster_pixels or ark.phenotyping.cluster_cells depending on the type of data being generated.

  • columns (List[str]) – The list of columns to subset the data in input_file on for consensus clustering.

  • max_k (int) – The number of consensus clusters to use.

  • cap (float) – The value to cap the data in input_file at after z-score normalization. Data will be within the range [-cap, cap].

assign_consensus_labels(external_data: pandas.DataFrame) pandas.DataFrame[source]

Takes an external dataset and applies ConsensusCluster mapping to it.

Parameters:

external_data (pandas.DataFrame) – A dataset which contains a '{self.cluster_type}_som_cluster' column.

Returns:

The external_data with a '{self.cluster_type}_meta_cluster' column attached.

Return type:

pandas.DataFrame

generate_som_to_meta_map()[source]

Maps each '{self.cluster_type}_som_cluster' to the meta cluster generated by ConsensusCluster.

Also assigns mapping to self.mapping for use in assign_consensus_labels.

run_consensus_clustering()[source]

Fits the meta clustering results using ConsensusCluster.

save_som_to_meta_map(save_path: Path)[source]

Saves the mapping generated by ConsensusCluster to save_path.

Parameters:

save_path (pathlib.Path) – The path to save self.mapping to.

scale_data()[source]

z-scores and caps input_data.

Scaling will be done on a per-column basis for all column names specified. Capping will truncate the data in the range [-cap, cap].

class ark.phenotyping.cluster_helpers.PixieSOMCluster(weights_path: Path, columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]

Bases: ABC

abstract __init__(weights_path: Path, columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]

Generic implementation of a pyFlowSOM runner

Parameters:
  • weights_path (pathlib.Path) – The path to save the weights to.

  • columns (List[str]) – The list of columns to subset the data on.

  • num_passes (int) – The number of SOM training passes to use.

  • xdim (int) – The number of SOM nodes on the x-axis.

  • ydim (int) – The number of SOM nodes on the y-axis.

  • lr_start (float) – The initial learning rate.

  • lr_end (float) – The learning rate to decay to

  • seed (int) – The random seed to use for training.

generate_som_clusters(external_data: pandas.DataFrame, num_parallel_obs: int = 1000000) numpy.ndarray[source]

Uses the weights to generate SOM clusters for a dataset

Parameters:
  • external_data (pandas.DataFrame) – The dataset to generate SOM clusters for

  • num_parallel_obs (int) – Partition size of external_data for assigning SOM labels

Returns:

The SOM clusters generated for each pixel in external_data

Return type:

numpy.ndarray

abstract normalize_data() pandas.DataFrame[source]

Generic implementation of the normalization process to use on the input data

Returns:

The data with columns normalized by the values in norm_data

Return type:

pandas.DataFrame

train_som(data: pandas.DataFrame)[source]

Trains the SOM on the data provided and saves the weights generated

Parameters:

data (pandas.DataFrame) – The input data to train the SOM on.

ark.phenotyping.cluster_helpers.verify_unique_meta_clusters(pixie_remapped_data: pandas.DataFrame, meta_cluster_type: Literal[‘pixel’, ‘cell’])[source]

Verifies that a mapping contains a unique renamed meta cluster for every base meta cluster

Parameters:
  • pixie_remapped_data (pandas.DataFrame) – Must have {pixel/cell}_meta_cluster and {pixel/cell}_meta_cluster_rename columns

  • meta_cluster_type (Literal[“pixel”, “cell”]) – Whether pixel or cell meta clusters are being validated

Raises:

ValueError – If there are duplicate {pixel/cell}_meta_cluster_rename entries for multiple {pixel/cell}_meta_cluster values

ark.phenotyping.pixel_cluster_utils

ark.phenotyping.pixel_cluster_utils.calculate_channel_percentiles(tiff_dir, fovs, channels, img_sub_folder, percentile)[source]

Calculates average percentile for each channel in the dataset

Parameters:
  • tiff_dir (str) – Name of the directory containing the tiff files

  • fovs (list) – List of fovs to include

  • channels (list) – List of channels to include

  • img_sub_folder (str) – Sub folder within each FOV containing image data

  • percentile (float) – The specific percentile to compute

Returns:

The mapping between each channel and its normalization value

Return type:

pd.DataFrame

ark.phenotyping.pixel_cluster_utils.calculate_pixel_intensity_percentile(tiff_dir, fovs, channels, img_sub_folder, channel_percentiles, percentile=0.05)[source]

Calculates average percentile per FOV for total signal in each pixel

Parameters:
  • tiff_dir (str) – Name of the directory containing the tiff files

  • fovs (list) – List of fovs to include

  • channels (list) – List of channels to include

  • img_sub_folder (str) – Sub folder within each FOV containing image data

  • channel_percentiles (pd.DataFrame) – The mapping between each channel and its normalization value Computed by calculate_channel_percentiles

  • percentile (float) – The pixel intensity percentile per FOV to average over

Returns:

The average percentile per FOV for total signal in each pixel

Return type:

float

ark.phenotyping.pixel_cluster_utils.check_for_modified_channels(tiff_dir, test_fov, img_sub_folder, channels)[source]

Checks to make sure the user selected newly modified channels

Parameters:
  • tiff_dir (str) – Name of the directory containing the tiff files

  • test_fov (str) – example fov used to check channel names

  • img_sub_folder (str) – sub-folder within each FOV containing image data

  • channels (list) – list of channels to use for analysis

ark.phenotyping.pixel_cluster_utils.compute_pixel_cluster_channel_avg(fovs, channels, base_dir, pixel_cluster_col, num_pixel_clusters, pixel_data_dir='pixel_mat_data', num_fovs_subset=100, seed=42, keep_count=False)[source]

Compute the average channel values across each pixel SOM cluster.

To improve performance, number of FOVs is downsampled by fov_subset_proportion

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directories

  • pixel_cluster_col (str) – Name of the column to group by

  • num_pixel_clusters (int) – The number of pixel clusters that are desired, if None then no fixed amount required

  • pixel_data_dir (str) – Name of the directory containing the pixel data with cluster labels

  • num_fovs_subset (float) – The number of FOVs to subset on. Note that if len(fovs) < num_fovs_subset, all of the FOVs will still be selected

  • seed (int) – The random seed to use for subsetting FOVs

  • keep_count (bool) – Whether to keep the count column when aggregating or not This should only be set to True for visualization purposes

Returns:

Contains the average channel values for each pixel SOM/meta cluster

Return type:

pandas.DataFrame

ark.phenotyping.pixel_cluster_utils.filter_with_nuclear_mask(fovs: List, tiff_dir: str, seg_dir: str, channel: str, nuc_seg_suffix: str = '_nuclear.tiff', img_sub_folder: str = None, exclude: bool = True)[source]

Filters out background staining using subcellular marker localization.

Non-nuclear signal is removed from nuclear markers and vice-versa for membrane markers.

Parameters:
  • fovs (list) – The list of fovs to filter

  • tiff_dir (str) – Name of the directory containing the tiff files

  • seg_dir (str) – Name of the directory containing the segmented files

  • channel (str) – Channel to apply filtering to

  • nuc_seg_suffix (str) – The suffix for the nuclear channel. (i.e. for “fov1”, a suffix of “_nuclear.tiff” would make a file named “fov1_nuclear.tiff”)

  • img_sub_folder (str) – Name of the subdirectory inside tiff_dir containing the tiff files. Set to None if there isn’t any.

  • exclude (bool) – Whether to filter out nuclear or membrane signal

ark.phenotyping.pixel_cluster_utils.find_fovs_missing_col(base_dir, data_dir, missing_col)[source]

Identify FOV names in data_dir without missing_col

Parameters:
  • base_dir (str) – The path to the data directories

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data

  • missing_col (str) – Name of the column to identify

Returns:

List of FOVs without missing_col

Return type:

list

ark.phenotyping.pixel_cluster_utils.normalize_rows(pixel_data, channels, include_seg_label=True)[source]

Normalizes the rows of a pixel matrix by their sum

Parameters:
  • pixel_data (pandas.DataFrame) – The dataframe containing the pixel data for a given fov Includes channel and meta (fov, label, etc.) columns

  • channels (list) – List of channels to subset over

  • include_seg_label (bool) – Whether to include 'label' as a metadata column

Returns:

The pixel data with rows normalized and 0-sum rows removed

Return type:

pandas.DataFrame

ark.phenotyping.pixel_cluster_utils.smooth_channels(fovs, tiff_dir, img_sub_folder, channels, smooth_vals)[source]

Adds additional smoothing for selected channels as a preprocessing step

Parameters:
  • fovs (list) – List of fovs to process

  • tiff_dir (str) – Name of the directory containing the tiff files

  • img_sub_folder (str) – sub-folder within each FOV containing image data

  • channels (list) – list of channels to apply smoothing to

  • smooth_vals (list or int) – amount to smooth channels. If a single int, applies to all channels. Otherwise, a custom value per channel can be supplied

ark.phenotyping.pixel_meta_clustering

ark.phenotyping.pixel_meta_clustering.apply_pixel_meta_cluster_remapping(fovs, channels, base_dir, pixel_data_dir, pixel_remapped_name, multiprocess=False, batch_size=5)[source]

Apply the meta cluster remapping to the data in pixel_data_dir.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directories

  • pixel_data_dir (str) – Name of directory with the full pixel data. This data should also have the SOM cluster labels appended from cluster_pixels and the meta cluster labels appended from pixel_consensus_cluster.

  • pixel_remapped_name (str) – Name of the file containing the pixel SOM clusters to their remapped meta clusters

  • multiprocess (bool) – Whether to use multiprocessing or not

  • batch_size (int) – The number of FOVs to process in parallel

ark.phenotyping.pixel_meta_clustering.generate_meta_avg_files(fovs, channels, base_dir, pixel_cc, data_dir='pixel_mat_data', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', pc_chan_avg_meta_cluster_name='pixel_channel_avg_meta_cluster.csv', num_fovs_subset=100, seed=42, overwrite=False)[source]

Computes and saves the average channel expression across pixel meta clusters. Assigns meta cluster labels to the data stored in pc_chan_avg_som_cluster_name.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directory

  • pixel_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data. This data should also have the SOM cluster labels appended from cluster_pixels.

  • pc_chan_avg_som_cluster_name (str) – Name of file to save the channel-averaged results across all SOM clusters to

  • pc_chan_avg_meta_cluster_name (str) – Name of file to save the channel-averaged results across all meta clusters to

  • num_fovs_subset (float) – The number of FOVs to subset on for meta cluster channel averaging

  • seed (int) – The random seed to use for subsetting FOVs

  • overwrite (bool) – If set, force overwrites the existing average channel expression file if it exists

ark.phenotyping.pixel_meta_clustering.generate_remap_avg_files(fovs, channels, base_dir, pixel_data_dir, pixel_remapped_name, pc_chan_avg_som_cluster_name, pc_chan_avg_meta_cluster_name, num_fovs_subset=100, seed=42)[source]

Resaves the re-mapped consensus data to pixel_data_dir and re-runs the average channel expression per pixel meta cluster computation.

Re-maps the pixel SOM clusters to meta clusters in pc_chan_avg_som_cluster_name.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directories

  • pixel_data_dir (str) – Name of directory with the full pixel data. This data should also have the SOM cluster labels appended from cluster_pixels and the meta cluster labels appended from pixel_consensus_cluster.

  • pixel_remapped_name (str) – Name of the file containing the pixel SOM clusters to their remapped meta clusters

  • pc_chan_avg_som_cluster_name (str) – Name of the file containing the channel-averaged results across all SOM clusters

  • pc_chan_avg_meta_cluster_name (str) – Name of the file containing the channel-averaged results across all meta clusters

  • num_fovs_subset (float) – The number of FOVs to subset on for meta cluster channel averaging

  • seed (int) – The random seed to use for subsetting FOVs

ark.phenotyping.pixel_meta_clustering.pixel_consensus_cluster(fovs, channels, base_dir, max_k=20, cap=3, data_dir='pixel_mat_data', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', multiprocess=False, batch_size=5, seed=42, overwrite=False)[source]

Run consensus clustering algorithm on pixel-level summed data across channels Saves data with consensus cluster labels to data_dir.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directory

  • max_k (int) – The number of consensus clusters

  • cap (int) – z-score cap to use when hierarchical clustering

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data. This data should also have the SOM cluster labels appended from cluster_pixels.

  • pc_chan_avg_som_cluster_name (str) – Name of file to save the channel-averaged results across all SOM clusters to

  • multiprocess (bool) – Whether to use multiprocessing or not

  • batch_size (int) – The number of FOVs to process in parallel, ignored if multiprocess is False

  • seed (int) – The random seed to set for consensus clustering

  • overwrite (bool) – If set, force overwrites the meta labels in all the FOVs

Returns:

The consensus cluster object containing the SOM to meta mapping

Return type:

cluster_helpers.PixieConsensusCluster

ark.phenotyping.pixel_meta_clustering.run_pixel_consensus_assignment(pixel_data_path, pixel_cc_obj, fov)[source]

Helper function to assign pixel consensus clusters

Parameters:
Returns:

The name of the FOV as well as the return code

Return type:

tuple (str, int)

ark.phenotyping.pixel_meta_clustering.update_pixel_meta_labels(pixel_data_path, pixel_remapped_dict, pixel_renamed_meta_dict, fov)[source]

Helper function to reassign meta cluster names based on remapping scheme to a FOV

Parameters:
  • pixel_data_path (str) – The path to the pixel data drectory

  • pixel_remapped_dict (dict) – The mapping from pixel SOM cluster to pixel meta cluster label (not renamed)

  • pixel_renamed_meta_dict (dict) – The mapping from pixel meta cluster label to renamed pixel meta cluster name

  • fov (str) – The name of the FOV to process

Returns:

The name of the FOV as well as the return code

Return type:

tuple (str, int)

ark.phenotyping.pixel_som_clustering

ark.phenotyping.pixel_som_clustering.cluster_pixels(fovs, base_dir, pixel_pysom, data_dir='pixel_mat_data', multiprocess=False, batch_size=5, num_parallel_pixels=1000000, overwrite=False)[source]

Uses trained SOM weights to assign cluster labels on full pixel data.

Saves data with cluster labels to data_dir.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • base_dir (str) – The path to the data directory

  • pixel_pysom (cluster_helpers.PixelSOMCluster) – The SOM cluster object containing the pixel SOM weights

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data

  • multiprocess (bool) – Whether to use multiprocessing or not

  • batch_size (int) – The number of FOVs to process in parallel, ignored if multiprocess is False

  • num_parallel_pixels (int) – How many pixels to label in parallel at once for each FOV

  • overwrite (bool) – If set, force overwrite the SOM labels in all the FOVs

ark.phenotyping.pixel_som_clustering.generate_som_avg_files(fovs, channels, base_dir, pixel_pysom, data_dir='pixel_data_dir', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', num_fovs_subset=100, require_all_som_clusters=True, seed=42, overwrite=False)[source]

Computes and saves the average channel expression across pixel SOM clusters.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directory

  • pixel_pysom (cluster_helpers.PixelSOMCluster) – The SOM cluster object containing the pixel SOM weights

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data

  • pc_chan_avg_som_cluster_name (str) – The name of the file to save the average channel expression across all SOM clusters

  • num_fovs_subset (int) – The number of FOVs to subset on for SOM cluster channel averaging

  • require_all_som_clusters (bool) – Whether to require all SOM clusters to have at least one pixel assigned

  • seed (int) – The random seed to set for subsetting FOVs

  • overwrite (bool) – If set, force overwrite the existing average channel expression file if it exists

ark.phenotyping.pixel_som_clustering.run_pixel_som_assignment(pixel_data_path, pixel_pysom_obj, overwrite, num_parallel_pixels, fov)[source]

Helper function to assign pixel SOM cluster labels

Parameters:
  • pixel_data_path (str) – The path to the pixel data directory

  • pixel_pysom_obj (ark.phenotyping.cluster_helpers.PixieConsensusCluster) – The pixel SOM cluster object

  • overwrite (bool) – Whether to overwrite the pixel SOM clusters or not

  • num_parallel_pixels (int) – How many pixels to label in parallel at once for each FOV

  • fov (str) – The name of the FOV to process

Returns:

The name of the FOV as well as the return code

Return type:

tuple (str, int)

ark.phenotyping.pixel_som_clustering.train_pixel_som(fovs, channels, base_dir, subset_dir='pixel_mat_subsetted', norm_vals_name='post_rowsum_chan_norm.feather', som_weights_name='pixel_som_weights.feather', xdim=10, ydim=10, lr_start=0.05, lr_end=0.01, num_passes=1, seed=42, overwrite=False)[source]

Run the SOM training on the subsetted pixel data.

Saves SOM weights to base_dir/som_weights_name.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of markers to subset on

  • base_dir (str) – The path to the data directories

  • subset_dir (str) – The name of the subsetted data directory

  • norm_vals_name (str) – The name of the file to store the 99.9% normalization values

  • som_weights_name (str) – The name of the file to save the SOM weights to

  • xdim (int) – The number of x nodes to use for the SOM

  • ydim (int) – The number of y nodes to use for the SOM

  • lr_start (float) – The start learning rate for the SOM, decays to lr_end

  • lr_end (float) – The end learning rate for the SOM, decays from lr_start

  • num_passes (int) – The number of training passes to make through the dataset

  • seed (int) – The random seed to use for training the SOM

  • overwrite (bool) – If set, force retrains the SOM and overwrites the weights

Returns:

The SOM cluster object containing the pixel SOM weights

Return type:

cluster_helpers.PixelSOMCluster

ark.phenotyping.pixie_preprocessing

ark.phenotyping.pixie_preprocessing.create_fov_pixel_data(fov, channels, img_data, seg_labels, pixel_thresh_val, blur_factor=2, subset_proportion=0.1)[source]

Preprocess pixel data for one fov

Parameters:
  • fov (str) – Name of the fov to index

  • channels (list) – List of channels to subset over

  • img_data (numpy.ndarray) – Array representing image data for one fov

  • seg_labels (numpy.ndarray) – Array representing segmentation labels for one fov

  • pixel_thresh_val (float) – value used to determine per-pixel cutoff for total signal inclusion

  • blur_factor (int) – The sigma to set for the Gaussian blur

  • subset_proportion (float) – The proportion of pixels to take from each fov

Returns:

Contains the following:

Return type:

tuple

ark.phenotyping.pixie_preprocessing.create_pixel_matrix(fovs, channels, base_dir, tiff_dir, seg_dir, img_sub_folder='TIFs', seg_suffix='_whole_cell.tiff', pixel_output_dir='pixel_output_dir', data_dir='pixel_mat_data', subset_dir='pixel_mat_subsetted', norm_vals_name_pre_rownorm='channel_norm_pre_rownorm.feather', norm_vals_name_post_rownorm='channel_norm_post_rownorm.feather', pixel_thresh_name='pixel_thresh.feather', channel_percentile_pre_rownorm=0.99, channel_percentile_post_rownorm=0.999, is_mibitiff=False, blur_factor=2, subset_proportion=0.1, seed=42, multiprocess=False, batch_size=5)[source]

For each fov, add a Gaussian blur to each channel and normalize channel sums for each pixel

Saves data to data_dir and subsetted data to subset_dir

Parameters:
  • fovs (list) – List of fovs to subset over

  • channels (list) – List of channels to subset over, applies only to pixel_mat_subset

  • base_dir (str) – The path to the data directories

  • tiff_dir (str) – Name of the directory containing the tiff files

  • seg_dir (str) – Name of the directory containing the segmented files. Set to None if no segmentation directory is available or desired.

  • img_sub_folder (str) – Name of the subdirectory inside tiff_dir containing the tiff files. Set to None if there isn’t any.

  • seg_suffix (str) – The suffix that the segmentation images use. Ignored if seg_dir is None.

  • pixel_output_dir (str) – The name of the data directory containing the pixel data to use for the clustering pipeline. data_dir and subset_dir should be placed here.

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data. Should be placed in pixel_output_dir.

  • subset_dir (str) – The name of the directory containing the subsetted pixel data. Should be placed in pixel_output_dir.

  • norm_vals_name_pre_rownorm (str) – The name of the file to store the pre-pixel-normalized norm values

  • norm_vals_name_post_rownorm (str) – The name of the file to store the post-pixel-normalized norm values

  • pixel_thresh_name (str) – The name of the file to store the pixel threshold value

  • channel_percentile_pre_rownorm (float) – Percentile used to normalize channels before pixel normalization

  • channel_percentile_post_rownorm (float) – Percentile used to normalize channels after pixel normalization

  • is_mibitiff (bool) – Whether to load the images from MIBITiff

  • blur_factor (int) – The sigma to set for the Gaussian blur

  • subset_proportion (float) – The proportion of pixels to take from each fov

  • seed (int) – The random seed to set for subsetting

  • multiprocess (bool) – Whether to use multiprocessing or not

  • batch_size (int) – The number of FOVs to process in parallel, ignored if multiprocess is False

ark.phenotyping.pixie_preprocessing.preprocess_fov(base_dir, tiff_dir, data_dir, subset_dir, seg_dir, seg_suffix, img_sub_folder, is_mibitiff, channels, blur_factor, subset_proportion, pixel_thresh_val, seed, channel_norm_df, fov)[source]

Helper function to read in the FOV-level pixel data, run create_fov_pixel_data, and save the preprocessed data.

Parameters:
  • base_dir (str) – The path to the data directories

  • tiff_dir (str) – Name of the directory containing the tiff files

  • data_dir (str) – Name of the directory which contains the full preprocessed pixel data

  • subset_dir (str) – The name of the directory containing the subsetted pixel data

  • seg_dir (str) – Name of the directory containing the segmented files. Set to None if no segmentation directory is available or desired.

  • seg_suffix (str) – The suffix that the segmentation images use. Ignored if seg_dir is None.

  • img_sub_folder (str) – Name of the subdirectory inside tiff_dir containing the tiff files. Set to None if there isn’t any.

  • is_mibitiff (bool) – Whether to load the images from MIBITiff

  • channels (list) – List of channels to subset over, applies only to pixel_mat_subset

  • blur_factor (int) – The sigma to set for the Gaussian blur

  • subset_proportion (float) – The proportion of pixels to take from each fov

  • pixel_thresh_val (float) – The value to normalize the pixels by

  • seed (int) – The random seed to set for subsetting

  • channel_norm_df (pandas.DataFrame) – The channel normalization values to use

  • fov (str) – The name of the FOV to preprocess

Returns:

The full preprocessed pixel dataset, needed for computing 99.9% normalized values in create_pixel_matrix

Return type:

pandas.DataFrame

ark.phenotyping.post_cluster_utils

ark.phenotyping.post_cluster_utils.create_mantis_project(cell_table: pandas.DataFrame, fovs: List[str], seg_dir: Union[Path, str], mask_dir: Union[Path, str], image_dir: Union[Path, str], mantis_dir: Union[Path, str], pop_col: str = 'cell_meta_cluster', fov_col: str = 'fov', label_col: str = 'label', seg_suffix_name: str = '_whole_cell.tiff') None[source]

Creates a complete Mantis Project for viewing cell labels.

Parameters:
  • cell_table (pd.DataFrame) – DataFrame of extracted cell features and subtypes.

  • fovs (List[str]) – A list of FOVs to use for creating the project.

  • seg_dir (Union[pathlib.Path, str]) – The path to the directory containing the segmentation images.

  • mask_dir (Union[pathlib.Path, str]) – The path to the directory where the masks will be stored.

  • image_dir (Union[pathlib.Path, str]) – The path to the directory containing the raw image data.

  • mantis_dir (Union[pathlib.Path, str]) – The path to the directory where the mantis project will be created.

  • pop_col (str, optional) – The column name containing the distinct cell populations. Defaults to settings.CELL_TYPE ("cell_meta_cluster")

  • fov_col (str, optional) – The column name containing the FOV IDs. Defaults to settings.FOV_ID ("fov").

  • label_col (str, optional) – The column name containing the cell label. Defaults to settings.CELL_LABEL ("label").

  • seg_suffix_name (str, optional) – The suffix of the segmentation file and it’s file extension. Defaults to "_whole_cell.tiff".

ark.phenotyping.post_cluster_utils.generate_new_cluster_resolution(cell_table, cluster_col, new_cluster_col, cluster_mapping, save_path)[source]

Add new column of more broad cell cluster assignments to the cell table.

Parameters:
  • cell_table (pd.DataFrame) – cell table with clustered cell populations

  • cluster_col (str) – column containing the cell phenotype

  • new_cluster_col (str) – new column to create

  • cluster_mapping (dict) – dictionary with keys detailing the new cluster names and values explaining which cell types to group together

  • save_path (str) – where to save the new cell table

ark.phenotyping.post_cluster_utils.plot_hist_thresholds(cell_table, populations, marker, pop_col='cell_meta_cluster', threshold=None, percentile=0.999)[source]

Create histograms to compare marker distributions across cell populations

Parameters:
  • cell_table (pd.DataFrame) – cell table with clustered cell populations

  • populations (list) – populations to plot as stacked histograms

  • marker (str) – the marker used to generate the histograms

  • pop_col (str) – the column containing the names of the cell populations

  • threshold (float, None) – optional value to plot a horizontal line for visualization

  • percentile (float) – cap used to control x axis limits of the plot

ark.phenotyping.weighted_channel_comp

ark.phenotyping.weighted_channel_comp.compute_cell_cluster_weighted_channel_avg(fovs, channels, base_dir, weighted_cell_channel_name, cell_cluster_data, cell_cluster_col='cell_meta_cluster')[source]

Computes the average weighted marker expression for each cell cluster

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directory

  • weighted_cell_channel_name (str) – The name of the weighted cell table, created in 3_Pixie_Cluster_Cells.ipynb

  • cell_cluster_data (pandas.DataFrame) – Name of the file containing the cell data with cluster labels

  • cell_cluster_col (str) – Whether to aggregate by cell SOM or meta labels Needs to be either ‘cell_som_cluster’, or ‘cell_meta_cluster’

Returns:

Each cell cluster mapped to the average expression for each marker

Return type:

pandas.DataFrame

ark.phenotyping.weighted_channel_comp.compute_p2c_weighted_channel_avg(pixel_channel_avg, channels, cell_counts, fovs=None, pixel_cluster_col='pixel_meta_cluster_rename')[source]

Compute the average marker expression for each cell weighted by pixel cluster

This expression is weighted by the pixel SOM/meta cluster counts. So for each cell, marker expression vector is computed by:

pixel_cluster_n_count * avg_marker_exp_pixel_cluster_n + ...

These values are then normalized by the cell’s respective size.

Note that this function will only be used to correct overlapping signal for visualization.

Parameters:
  • pixel_channel_avg (pandas.DataFrame) – The average channel values for each pixel SOM/meta cluster Computed by compute_pixel_cluster_channel_avg

  • channels (list) – The list of channels to subset pixel_channel_avg by

  • cell_counts (pandas.DataFrame) – The dataframe listing the number of each type of pixel SOM/meta cluster per cell

  • fovs (list) – The list of fovs to include, if None provided all are used

  • pixel_cluster_col (str) – Name of the cell cluster column to group by Should be 'pixel_som_cluster' or 'pixel_meta_cluster_rename'

Returns:

Returns the average marker expression for each cell in the dataset

Return type:

pandas.DataFrame

ark.phenotyping.weighted_channel_comp.generate_remap_avg_wc_files(fovs, channels, base_dir, cell_som_input_data, cell_remapped_name, weighted_cell_channel_name, cell_som_cluster_channel_avg_name, cell_meta_cluster_channel_avg_name)[source]

Apply the cell cluster remapping to the average weighted channel files

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directory

  • cell_som_input_data (pandas.DataFrame) – The input data used for SOM training. For weighted channel averaging, this should contain the number of pixel SOM/meta cluster counts of each cell, normalized by cell_size.

  • cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters

  • weighted_cell_channel_name (str) – The name of the file containing the weighted channel expression table

  • cell_som_cluster_channel_avg_name (str) – The name of the file to save the average weighted channel expression per cell SOM cluster

  • cell_meta_cluster_channel_avg_name (str) – Same as above except for cell meta clusters

ark.phenotyping.weighted_channel_comp.generate_wc_avg_files(fovs, channels, base_dir, cell_cc, cell_som_input_data, weighted_cell_channel_name='weighted_cell_channel.feather', cell_som_cluster_channel_avg_name='cell_som_cluster_channel_avg.csv', cell_meta_cluster_channel_avg_name='cell_meta_cluster_channel_avg.csv', overwrite=False)[source]

Generate the weighted channel average files per cell SOM and meta clusters.

When running cell clustering with pixel clusters generated from Pixie, the counts of each pixel cluster per cell is computed. These are multiplied by the average expression profile of each pixel cluster to determine weighted channel average. This computation is averaged by both cell SOM and meta cluster.

Parameters:
  • fovs (list) – The list of fovs to subset on

  • channels (list) – The list of channels to subset on

  • base_dir (str) – The path to the data directory

  • cell_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping

  • cell_som_input_data (str) – The input data used for SOM training. For weighted channel averaging, it should contain the number of pixel SOM/meta cluster counts of each cell, normalized by cell_size.

  • weighted_cell_channel_name (str) – The name of the file containing the weighted channel expression table

  • cell_som_cluster_channel_avg_name (str) – The name of the file to save the average weighted channel expression per cell SOM cluster

  • cell_meta_cluster_channel_avg_name (str) – Same as above except for cell meta clusters

  • overwrite (bool) – If set, regenerate average weighted channel expression for SOM and meta clusters

ark.phenotyping.weighted_channel_comp.generate_weighted_channel_avg_heatmap(cell_cluster_channel_avg_path, cell_cluster_col, channels, raw_cmap, renamed_cmap, center_val=0, min_val=-3, max_val=3)[source]

Generates a z-scored heatmap of the average weighted channel expression per cell cluster

Parameters:
  • cell_cluster_channel_avg_path (str) – Path to the file containing the average weighted channel expression per cell cluster

  • cell_cluster_col (str) – The name of the cell cluster col, needs to be either ‘cell_som_cluster’ or ‘cell_meta_cluster_rename’

  • channels (str) – The list of channels to visualize

  • raw_cmap (dict) – Maps the raw meta cluster labels to their respective colors, created by generate_meta_cluster_colormap_dict

  • renamed_cmap (dict) – Maps the renamed meta cluster labels to their respective colors, created by generate_meta_cluster_colormap_dict

  • center_val (float) – value at which to center the heatmap

  • min_val (float) – minimum value the heatmap should take

  • max_val (float) – maximum value the heatmap should take