ark.phenotyping¶
ark.phenotyping.cell_cluster_utils¶
- ark.phenotyping.cell_cluster_utils.add_consensus_labels_cell_table(base_dir, cell_table_path, cell_som_input_data)[source]¶
Adds the consensus cluster labels to the cell table, then resaves data to
{cell_table_path}_cell_labels.csv
- Parameters:
base_dir (str) – The path to the data directory
cell_table_path (str) – Path of the cell table, needs to be created with
Segment_Image_Data.ipynb
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training
- ark.phenotyping.cell_cluster_utils.compute_cell_som_cluster_cols_avg(cell_cluster_data, cell_som_cluster_cols, cell_cluster_col, keep_count=False)[source]¶
For each cell SOM cluster, compute the average expression of all
cell_som_cluster_cols
- Parameters:
cell_cluster_data (pandas.DataFrame) – The cell data with SOM and/or meta labels, created by
cluster_cells
orcell_consensus_cluster
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_cluster_col (str) – Name of the cell cluster column to group by, should be
'cell_som_cluster'
or'cell_meta_cluster'
keep_count (bool) – Whether to include the cell counts or not, should only be set to
True
for visualization support
- Returns:
Contains the average values for each column across cell SOM clusters
- Return type:
- ark.phenotyping.cell_cluster_utils.create_c2pc_data(fovs, pixel_data_path, cell_table_path, pixel_cluster_col='pixel_meta_cluster_rename')[source]¶
Create a matrix with each fov-cell label pair and their SOM pixel/meta cluster counts
- Parameters:
fovs (list) – The list of fovs to subset on
pixel_data_path (str) – Path to directory with the pixel data with SOM and meta labels attached. Created by
pixel_consensus_cluster
.cell_table_path (str) – Path to the cell table, needs to be created with
Segment_Image_Data.ipynb
pixel_cluster_col (str) – The name of the pixel cluster column to count per cell Should be
'pixel_som_cluster'
or'pixel_meta_cluster_rename'
- Returns:
pandas.DataFrame
: cell x cluster counts of each pixel SOM/meta cluster per each cellpandas.DataFrame
: same as above, but normalized bycell_size
- Return type:
ark.phenotyping.cell_meta_clustering¶
- ark.phenotyping.cell_meta_clustering.apply_cell_meta_cluster_remapping(base_dir, cell_som_input_data, cell_remapped_name)[source]¶
Apply the meta cluster remapping to the data in
cell_consensus_name
. Resave the re-mapped consensus data tocell_consensus_name
.- Parameters:
base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training
cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters
- Returns:
The input data used for SOM training with renamed meta labels attached
- Return type:
- ark.phenotyping.cell_meta_clustering.cell_consensus_cluster(base_dir, cell_som_cluster_cols, cell_som_input_data, cell_som_expr_col_avg_name, max_k=20, cap=3, seed=42, overwrite=False)[source]¶
Run consensus clustering algorithm on cell-level data averaged across each cell SOM cluster.
Saves data with consensus cluster labels to cell_consensus_name.
- Parameters:
base_dir (str) – The path to the data directory
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_input_data (pandas.DataFrame) – The data used for SOM training with SOM labels attached
cell_som_expr_col_avg_name (str) – The name of the file with the average expression per column across cell SOM clusters. Used to run consensus clustering on.
max_k (int) – The number of consensus clusters
cap (int) – z-score cap to use when hierarchical clustering
seed (int) – The random seed to set for consensus clustering
overwrite (bool) – If set, overwrites the meta cluster assignments if they exist
- Returns:
cluster_helpers.PixieConsensusCluster: the consensus cluster object containing the SOM to meta mapping
pandas.DataFrame: the input data used for SOM training with meta labels attached
- Return type:
- ark.phenotyping.cell_meta_clustering.generate_meta_avg_files(base_dir, cell_cc, cell_som_cluster_cols, cell_som_input_data, cell_som_expr_col_avg_name, cell_meta_expr_col_avg_name, overwrite=False)[source]¶
Computes and saves the average cluster column expression across pixel meta clusters. Assigns meta cluster labels to the data stored in
cell_som_expr_col_avg_name
.- Parameters:
base_dir (str) – The path to the data directory
cell_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training. Will have meta labels appended after this process is run.
cell_som_expr_col_avg_name (str) – The average values of
cell_som_cluster_cols
per cell SOM cluster. Used to run consensus clustering on.cell_meta_expr_col_avg_name (str) – Same as above except for cell meta clusters
overwrite (bool) – If set, regenerate the averages of
cell_som_cluster_cols
per meta cluster
- ark.phenotyping.cell_meta_clustering.generate_remap_avg_count_files(base_dir, cell_som_input_data, cell_remapped_name, cell_som_cluster_cols, cell_som_expr_col_avg_name, cell_meta_expr_col_avg_name)[source]¶
Apply the cell cluster remapping to the average count files
- Parameters:
base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training
cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_expr_col_avg_name (str) – The average values of
cell_som_cluster_cols
per cell SOM clustercell_meta_expr_col_avg_name (str) – Same as above except for cell meta clusters
ark.phenotyping.cell_som_clustering¶
- ark.phenotyping.cell_som_clustering.cluster_cells(base_dir, cell_pysom, cell_som_cluster_cols, num_parallel_cells=1000000, overwrite=False)[source]¶
Uses trained SOM weights to assign cluster labels on full cell data.
Saves data with cluster labels to
cell_cluster_name
.- Parameters:
base_dir (str) – The path to the data directory
cell_pysom (cluster_helpers.CellSOMCluster) – The SOM cluster object containing the cell SOM weights
cell_som_cluster_cols (list) – The list of columns used for SOM training
num_parallel_cells (int) – How many cells to label in parallel at once
overwrite (bool) – If set, overwrites the SOM cluster assignments if they exist
- Returns:
The cell data in
cell_pysom.cell_data
with SOM labels assigned- Return type:
- ark.phenotyping.cell_som_clustering.generate_som_avg_files(base_dir, cell_som_input_data, cell_som_cluster_cols, cell_som_expr_col_avg_name, overwrite=False)[source]¶
Computes and saves the average expression of all
cell_som_cluster_cols
across cell SOM clusters.- Parameters:
base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training with SOM labels attached
cell_som_cluster_cols (list) – The list of columns used for SOM training
cell_som_expr_col_avg_name (str) – The name of the file to write the average expression per column across cell SOM clusters
overwrite (bool) – If set, regenerate the averages of
cell_som_cluster_columns
for SOM clusters
- ark.phenotyping.cell_som_clustering.train_cell_som(fovs, base_dir, cell_table_path, cell_som_cluster_cols, cell_som_input_data, som_weights_name='cell_som_weights.feather', xdim=10, ydim=10, lr_start=0.05, lr_end=0.01, num_passes=1, seed=42, overwrite=False, normalize=True)[source]¶
Run the SOM training on the expression columns specified in
cell_som_cluster_cols
.Saves the SOM weights to
base_dir/som_weights_name
.- Parameters:
fovs (list) – The list of fovs to subset on
base_dir (str) – The path to the data directories
cell_table_path (str) – Path of the cell table, needs to be created with
Segment_Image_Data.ipynb
cell_som_cluster_cols (list) – The list of columns in
cell_som_input_data_name
to use for SOM trainingcell_som_input_data (pandas.DataFrame) – The input data to use for SOM training
som_weights_name (str) – The name of the file to save the SOM weights to
xdim (int) – The number of x nodes to use for the SOM
ydim (int) – The number of y nodes to use for the SOM
lr_start (float) – The start learning rate for the SOM, decays to
lr_end
lr_end (float) – The end learning rate for the SOM, decays from
lr_start
num_passes (int) – The number of training passes to make through the dataset
seed (int) – The random seed to use for training the SOM
overwrite (bool) – If set, force retrains the SOM and overwrites the weights
normalize (bool) – Whether to perform 99.9% percentile normalization, default to True.
- Returns:
The SOM cluster object containing the cell SOM weights
- Return type:
ark.phenotyping.cluster_helpers¶
- class ark.phenotyping.cluster_helpers.CellSOMCluster(cell_data: pandas.DataFrame, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42, normalize=True)[source]¶
Bases:
PixieSOMCluster
- __init__(cell_data: pandas.DataFrame, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42, normalize=True)[source]¶
Creates a cell SOM cluster object derived from the abstract PixieSOMCluster
- Parameters:
cell_data (pandas.DataFrame) – The dataset to use for training
weights_path (pathlib.Path) – The path to save the weights to.
fovs (List[str]) – The list of FOVs to subset the data on.
columns (List[str]) – The list of columns to subset the data on.
num_passes (int) – The number of SOM training passes to use.
xdim (int) – The number of SOM nodes on the x-axis.
ydim (int) – The number of SOM nodes on the y-axis.
lr_start (float) – The initial learning rate.
lr_end (float) – The learning rate to decay to.
seed (int) – The random seed to use.
normalize (bool) – Whether to perform 99.9% percentile normalization, default to True.
- assign_som_clusters(num_parallel_cells=1000000) pandas.DataFrame [source]¶
Assigns SOM clusters using
weights
tocell_data
- Parameters:
external_data (pandas.DataFrame) – The dataset to assign SOM clusters to
num_parallel_cells (int) – Partition size of
self.cell_data
for assigning SOM labels
- Returns:
cell_data
with the SOM clusters assigned.- Return type:
- class ark.phenotyping.cluster_helpers.ClusterClassTemplate(*args, **kwargs)[source]¶
Bases:
Protocol
- class ark.phenotyping.cluster_helpers.ConsensusCluster(cluster: ClusterClassTemplate, L: int, K: int, H: int, resample_proportion: float = 0.5)[source]¶
Bases:
object
- __init__(cluster: ClusterClassTemplate, L: int, K: int, H: int, resample_proportion: float = 0.5)[source]¶
Implementation of Consensus clustering, following the paper https://link.springer.com/content/pdf/10.1023%2FA%3A1023949509487.pdf
- Parameters:
cluster (Callable) –
Clustering class.
NOTE: the class is to be instantiated with parameter
n_clusters
, and possess afit_predict
method, which is invoked on data.L (int) – Smallest number of clusters to try.
K (int) – Biggest number of clusters to try.
H (int) – Number of resamplings for each cluster number.
resample_proportion (float) – Percentage to sample.
Mk (numpy.ndarray) – Consensus matrices for each k (shape =(K,data.shape[0],data.shape[0])). NOTE: every consensus matrix is retained, like specified in the paper.
Ak (numpy.ndarray) – Area under CDF for each number of clusters. See paper: section 3.3.1. Consensus distribution.
deltaK (numpy.ndarray) – Changes in areas under CDF. See paper: section 3.3.1. Consensus distribution.
self.bestK (int) – Number of clusters that was found to be best.
- fit(data: pandas.DataFrame, verbose: bool = False)[source]¶
Fits a consensus matrix for each number of clusters
- Parameters:
data (pd.DataFrame) – The data in
(examples,attributes)
format.verbose (bool) – Should print or not.
- predict()[source]¶
Predicts on the consensus matrix, for best found cluster number.
- Returns:
The consensus matrix prediction for
self.bestK
.- Return type:
- predict_data(data: pandas.DataFrame)[source]¶
Predicts on the data, for best found cluster number
- Parameters:
data (pandas.DataFrame) –
(examples,attributes)
format- Returns:
The data matrix prediction for
self.bestK
.- Return type:
- class ark.phenotyping.cluster_helpers.PixelSOMCluster(pixel_subset_folder: Path, norm_vals_path: Path, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶
Bases:
PixieSOMCluster
- __init__(pixel_subset_folder: Path, norm_vals_path: Path, weights_path: Path, fovs: List[str], columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶
Creates a pixel SOM cluster object derived from the abstract PixieSOMCluster
- Parameters:
pixel_subset_folder (pathlib.Path) – The name of the subsetted pixel data directory
norm_vals_path (pathlib.Path) – The name of the feather file containing the normalization values.
weights_path (pathlib.Path) – The path to save the weights to.
fovs (List[str]) – The list of FOVs to subset the data on.
columns (List[str]) – The list of columns to subset the data on.
num_passes (int) – The number of SOM training passes to use.
xdim (int) – The number of SOM nodes on the x-axis.
ydim (int) – The number of SOM nodes on the y-axis.
lr_start (float) – The initial learning rate.
lr_end (float) – The learning rate to decay to.
seed (int) – The random seed to use.
- assign_som_clusters(external_data: pandas.DataFrame, normalize_data: bool = True, num_parallel_pixels: int = 1000000) pandas.DataFrame [source]¶
Assigns SOM clusters using
weights
to a dataset- Parameters:
external_data (pandas.DataFrame) – The dataset to assign SOM clusters to
normalize_data (bool) – Whether or not to normalize
external_data
. Flag needed to prevent re-normalization of normalized dataset.num_parallel_pixels (int) – Partition size of
external_data
for assigning SOM labels
- Returns:
The dataset with the SOM clusters assigned.
- Return type:
- normalize_data(external_data: pandas.DataFrame) pandas.DataFrame [source]¶
Uses
norm_data
to normalize a dataset- Parameters:
external_data (pandas.DataFrame) – The data to normalize
- Returns:
The data with
columns
normalized by the values innorm_data
- Return type:
- class ark.phenotyping.cluster_helpers.PixieConsensusCluster(cluster_type: str, input_file: Path, columns: List[str], max_k: int = 20, cap: float = 3)[source]¶
Bases:
object
- __init__(cluster_type: str, input_file: Path, columns: List[str], max_k: int = 20, cap: float = 3)[source]¶
Constructs a generic ConsensusCluster pipeline object that makes use of Sagovic’s implementation of consensus clustering in Python.
- Parameters:
cluster_type (str) – The type of data being run through consensus clustering. Must be either
'pixel'
or'cell'
input_file (pathlib.Path) – The average expression values per SOM cluster .csv, computed by
ark.phenotyping.cluster_pixels
orark.phenotyping.cluster_cells
depending on the type of data being generated.columns (List[str]) – The list of columns to subset the data in
input_file
on for consensus clustering.max_k (int) – The number of consensus clusters to use.
cap (float) – The value to cap the data in
input_file
at after z-score normalization. Data will be within the range[-cap, cap]
.
- assign_consensus_labels(external_data: pandas.DataFrame) pandas.DataFrame [source]¶
Takes an external dataset and applies
ConsensusCluster
mapping to it.- Parameters:
external_data (pandas.DataFrame) – A dataset which contains a
'{self.cluster_type}_som_cluster'
column.- Returns:
The
external_data
with a'{self.cluster_type}_meta_cluster'
column attached.- Return type:
- generate_som_to_meta_map()[source]¶
Maps each
'{self.cluster_type}_som_cluster'
to the meta cluster generated byConsensusCluster
.Also assigns mapping to
self.mapping
for use inassign_consensus_labels
.
- run_consensus_clustering()[source]¶
Fits the meta clustering results using
ConsensusCluster
.
- save_som_to_meta_map(save_path: Path)[source]¶
Saves the mapping generated by
ConsensusCluster
tosave_path
.- Parameters:
save_path (pathlib.Path) – The path to save
self.mapping
to.
- class ark.phenotyping.cluster_helpers.PixieSOMCluster(weights_path: Path, columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶
Bases:
ABC
- abstract __init__(weights_path: Path, columns: List[str], num_passes: int = 1, xdim: int = 10, ydim: int = 10, lr_start: float = 0.05, lr_end: float = 0.01, seed=42)[source]¶
Generic implementation of a pyFlowSOM runner
- Parameters:
weights_path (pathlib.Path) – The path to save the weights to.
columns (List[str]) – The list of columns to subset the data on.
num_passes (int) – The number of SOM training passes to use.
xdim (int) – The number of SOM nodes on the x-axis.
ydim (int) – The number of SOM nodes on the y-axis.
lr_start (float) – The initial learning rate.
lr_end (float) – The learning rate to decay to
seed (int) – The random seed to use for training.
- generate_som_clusters(external_data: pandas.DataFrame, num_parallel_obs: int = 1000000) numpy.ndarray [source]¶
Uses the weights to generate SOM clusters for a dataset
- Parameters:
external_data (pandas.DataFrame) – The dataset to generate SOM clusters for
num_parallel_obs (int) – Partition size of
external_data
for assigning SOM labels
- Returns:
The SOM clusters generated for each pixel in
external_data
- Return type:
- abstract normalize_data() pandas.DataFrame [source]¶
Generic implementation of the normalization process to use on the input data
- Returns:
The data with
columns
normalized by the values innorm_data
- Return type:
- train_som(data: pandas.DataFrame)[source]¶
Trains the SOM on the data provided and saves the weights generated
- Parameters:
data (pandas.DataFrame) – The input data to train the SOM on.
- ark.phenotyping.cluster_helpers.verify_unique_meta_clusters(pixie_remapped_data: pandas.DataFrame, meta_cluster_type: Literal[‘pixel’, ‘cell’])[source]¶
Verifies that a mapping contains a unique renamed meta cluster for every base meta cluster
- Parameters:
pixie_remapped_data (pandas.DataFrame) – Must have
{pixel/cell}_meta_cluster
and{pixel/cell}_meta_cluster_rename
columnsmeta_cluster_type (Literal[“pixel”, “cell”]) – Whether pixel or cell meta clusters are being validated
- Raises:
ValueError – If there are duplicate
{pixel/cell}_meta_cluster_rename
entries for multiple{pixel/cell}_meta_cluster
values
ark.phenotyping.pixel_cluster_utils¶
- ark.phenotyping.pixel_cluster_utils.calculate_channel_percentiles(tiff_dir, fovs, channels, img_sub_folder, percentile)[source]¶
Calculates average percentile for each channel in the dataset
- Parameters:
- Returns:
The mapping between each channel and its normalization value
- Return type:
pd.DataFrame
- ark.phenotyping.pixel_cluster_utils.calculate_pixel_intensity_percentile(tiff_dir, fovs, channels, img_sub_folder, channel_percentiles, percentile=0.05)[source]¶
Calculates average percentile per FOV for total signal in each pixel
- Parameters:
tiff_dir (str) – Name of the directory containing the tiff files
fovs (list) – List of fovs to include
channels (list) – List of channels to include
img_sub_folder (str) – Sub folder within each FOV containing image data
channel_percentiles (pd.DataFrame) – The mapping between each channel and its normalization value Computed by
calculate_channel_percentiles
percentile (float) – The pixel intensity percentile per FOV to average over
- Returns:
The average percentile per FOV for total signal in each pixel
- Return type:
- ark.phenotyping.pixel_cluster_utils.check_for_modified_channels(tiff_dir, test_fov, img_sub_folder, channels)[source]¶
Checks to make sure the user selected newly modified channels
- ark.phenotyping.pixel_cluster_utils.compute_pixel_cluster_channel_avg(fovs, channels, base_dir, pixel_cluster_col, num_pixel_clusters, pixel_data_dir='pixel_mat_data', num_fovs_subset=100, seed=42, keep_count=False)[source]¶
Compute the average channel values across each pixel SOM cluster.
To improve performance, number of FOVs is downsampled by
fov_subset_proportion
- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directories
pixel_cluster_col (str) – Name of the column to group by
num_pixel_clusters (int) – The number of pixel clusters that are desired, if None then no fixed amount required
pixel_data_dir (str) – Name of the directory containing the pixel data with cluster labels
num_fovs_subset (float) – The number of FOVs to subset on. Note that if
len(fovs) < num_fovs_subset
, all of the FOVs will still be selectedseed (int) – The random seed to use for subsetting FOVs
keep_count (bool) – Whether to keep the count column when aggregating or not This should only be set to
True
for visualization purposes
- Returns:
Contains the average channel values for each pixel SOM/meta cluster
- Return type:
- ark.phenotyping.pixel_cluster_utils.filter_with_nuclear_mask(fovs: List, tiff_dir: str, seg_dir: str, channel: str, nuc_seg_suffix: str = '_nuclear.tiff', img_sub_folder: str = None, exclude: bool = True)[source]¶
Filters out background staining using subcellular marker localization.
Non-nuclear signal is removed from nuclear markers and vice-versa for membrane markers.
- Parameters:
fovs (list) – The list of fovs to filter
tiff_dir (str) – Name of the directory containing the tiff files
seg_dir (str) – Name of the directory containing the segmented files
channel (str) – Channel to apply filtering to
nuc_seg_suffix (str) – The suffix for the nuclear channel. (i.e. for “fov1”, a suffix of “_nuclear.tiff” would make a file named “fov1_nuclear.tiff”)
img_sub_folder (str) – Name of the subdirectory inside
tiff_dir
containing the tiff files. Set toNone
if there isn’t any.exclude (bool) – Whether to filter out nuclear or membrane signal
- ark.phenotyping.pixel_cluster_utils.find_fovs_missing_col(base_dir, data_dir, missing_col)[source]¶
Identify FOV names in
data_dir
withoutmissing_col
- ark.phenotyping.pixel_cluster_utils.normalize_rows(pixel_data, channels, include_seg_label=True)[source]¶
Normalizes the rows of a pixel matrix by their sum
- Parameters:
pixel_data (pandas.DataFrame) – The dataframe containing the pixel data for a given fov Includes channel and meta (
fov
,label
, etc.) columnschannels (list) – List of channels to subset over
include_seg_label (bool) – Whether to include
'label'
as a metadata column
- Returns:
The pixel data with rows normalized and 0-sum rows removed
- Return type:
- ark.phenotyping.pixel_cluster_utils.smooth_channels(fovs, tiff_dir, img_sub_folder, channels, smooth_vals)[source]¶
Adds additional smoothing for selected channels as a preprocessing step
- Parameters:
fovs (list) – List of fovs to process
tiff_dir (str) – Name of the directory containing the tiff files
img_sub_folder (str) – sub-folder within each FOV containing image data
channels (list) – list of channels to apply smoothing to
smooth_vals (list or int) – amount to smooth channels. If a single int, applies to all channels. Otherwise, a custom value per channel can be supplied
ark.phenotyping.pixel_meta_clustering¶
- ark.phenotyping.pixel_meta_clustering.apply_pixel_meta_cluster_remapping(fovs, channels, base_dir, pixel_data_dir, pixel_remapped_name, multiprocess=False, batch_size=5)[source]¶
Apply the meta cluster remapping to the data in
pixel_data_dir
.- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directories
pixel_data_dir (str) – Name of directory with the full pixel data. This data should also have the SOM cluster labels appended from
cluster_pixels
and the meta cluster labels appended frompixel_consensus_cluster
.pixel_remapped_name (str) – Name of the file containing the pixel SOM clusters to their remapped meta clusters
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel
- ark.phenotyping.pixel_meta_clustering.generate_meta_avg_files(fovs, channels, base_dir, pixel_cc, data_dir='pixel_mat_data', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', pc_chan_avg_meta_cluster_name='pixel_channel_avg_meta_cluster.csv', num_fovs_subset=100, seed=42, overwrite=False)[source]¶
Computes and saves the average channel expression across pixel meta clusters. Assigns meta cluster labels to the data stored in
pc_chan_avg_som_cluster_name
.- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
pixel_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping
data_dir (str) – Name of the directory which contains the full preprocessed pixel data. This data should also have the SOM cluster labels appended from
cluster_pixels
.pc_chan_avg_som_cluster_name (str) – Name of file to save the channel-averaged results across all SOM clusters to
pc_chan_avg_meta_cluster_name (str) – Name of file to save the channel-averaged results across all meta clusters to
num_fovs_subset (float) – The number of FOVs to subset on for meta cluster channel averaging
seed (int) – The random seed to use for subsetting FOVs
overwrite (bool) – If set, force overwrites the existing average channel expression file if it exists
- ark.phenotyping.pixel_meta_clustering.generate_remap_avg_files(fovs, channels, base_dir, pixel_data_dir, pixel_remapped_name, pc_chan_avg_som_cluster_name, pc_chan_avg_meta_cluster_name, num_fovs_subset=100, seed=42)[source]¶
Resaves the re-mapped consensus data to
pixel_data_dir
and re-runs the average channel expression per pixel meta cluster computation.Re-maps the pixel SOM clusters to meta clusters in
pc_chan_avg_som_cluster_name
.- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directories
pixel_data_dir (str) – Name of directory with the full pixel data. This data should also have the SOM cluster labels appended from
cluster_pixels
and the meta cluster labels appended frompixel_consensus_cluster
.pixel_remapped_name (str) – Name of the file containing the pixel SOM clusters to their remapped meta clusters
pc_chan_avg_som_cluster_name (str) – Name of the file containing the channel-averaged results across all SOM clusters
pc_chan_avg_meta_cluster_name (str) – Name of the file containing the channel-averaged results across all meta clusters
num_fovs_subset (float) – The number of FOVs to subset on for meta cluster channel averaging
seed (int) – The random seed to use for subsetting FOVs
- ark.phenotyping.pixel_meta_clustering.pixel_consensus_cluster(fovs, channels, base_dir, max_k=20, cap=3, data_dir='pixel_mat_data', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', multiprocess=False, batch_size=5, seed=42, overwrite=False)[source]¶
Run consensus clustering algorithm on pixel-level summed data across channels Saves data with consensus cluster labels to
data_dir
.- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
max_k (int) – The number of consensus clusters
cap (int) – z-score cap to use when hierarchical clustering
data_dir (str) – Name of the directory which contains the full preprocessed pixel data. This data should also have the SOM cluster labels appended from
cluster_pixels
.pc_chan_avg_som_cluster_name (str) – Name of file to save the channel-averaged results across all SOM clusters to
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel, ignored if
multiprocess
isFalse
seed (int) – The random seed to set for consensus clustering
overwrite (bool) – If set, force overwrites the meta labels in all the FOVs
- Returns:
The consensus cluster object containing the SOM to meta mapping
- Return type:
- ark.phenotyping.pixel_meta_clustering.run_pixel_consensus_assignment(pixel_data_path, pixel_cc_obj, fov)[source]¶
Helper function to assign pixel consensus clusters
- Parameters:
pixel_data_path (str) – The path to the pixel data directory
pixel_cc_obj (ark.phenotyping.cluster_helpers.PixieConsensusCluster) – The pixel consensus cluster object
fov (str) – The name of the FOV to process
- Returns:
The name of the FOV as well as the return code
- Return type:
- ark.phenotyping.pixel_meta_clustering.update_pixel_meta_labels(pixel_data_path, pixel_remapped_dict, pixel_renamed_meta_dict, fov)[source]¶
Helper function to reassign meta cluster names based on remapping scheme to a FOV
- Parameters:
pixel_data_path (str) – The path to the pixel data drectory
pixel_remapped_dict (dict) – The mapping from pixel SOM cluster to pixel meta cluster label (not renamed)
pixel_renamed_meta_dict (dict) – The mapping from pixel meta cluster label to renamed pixel meta cluster name
fov (str) – The name of the FOV to process
- Returns:
The name of the FOV as well as the return code
- Return type:
ark.phenotyping.pixel_som_clustering¶
- ark.phenotyping.pixel_som_clustering.cluster_pixels(fovs, base_dir, pixel_pysom, data_dir='pixel_mat_data', multiprocess=False, batch_size=5, num_parallel_pixels=1000000, overwrite=False)[source]¶
Uses trained SOM weights to assign cluster labels on full pixel data.
Saves data with cluster labels to
data_dir
.- Parameters:
fovs (list) – The list of fovs to subset on
base_dir (str) – The path to the data directory
pixel_pysom (cluster_helpers.PixelSOMCluster) – The SOM cluster object containing the pixel SOM weights
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel, ignored if
multiprocess
isFalse
num_parallel_pixels (int) – How many pixels to label in parallel at once for each FOV
overwrite (bool) – If set, force overwrite the SOM labels in all the FOVs
- ark.phenotyping.pixel_som_clustering.generate_som_avg_files(fovs, channels, base_dir, pixel_pysom, data_dir='pixel_data_dir', pc_chan_avg_som_cluster_name='pixel_channel_avg_som_cluster.csv', num_fovs_subset=100, require_all_som_clusters=True, seed=42, overwrite=False)[source]¶
Computes and saves the average channel expression across pixel SOM clusters.
- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
pixel_pysom (cluster_helpers.PixelSOMCluster) – The SOM cluster object containing the pixel SOM weights
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
pc_chan_avg_som_cluster_name (str) – The name of the file to save the average channel expression across all SOM clusters
num_fovs_subset (int) – The number of FOVs to subset on for SOM cluster channel averaging
require_all_som_clusters (bool) – Whether to require all SOM clusters to have at least one pixel assigned
seed (int) – The random seed to set for subsetting FOVs
overwrite (bool) – If set, force overwrite the existing average channel expression file if it exists
- ark.phenotyping.pixel_som_clustering.run_pixel_som_assignment(pixel_data_path, pixel_pysom_obj, overwrite, num_parallel_pixels, fov)[source]¶
Helper function to assign pixel SOM cluster labels
- Parameters:
pixel_data_path (str) – The path to the pixel data directory
pixel_pysom_obj (ark.phenotyping.cluster_helpers.PixieConsensusCluster) – The pixel SOM cluster object
overwrite (bool) – Whether to overwrite the pixel SOM clusters or not
num_parallel_pixels (int) – How many pixels to label in parallel at once for each FOV
fov (str) – The name of the FOV to process
- Returns:
The name of the FOV as well as the return code
- Return type:
- ark.phenotyping.pixel_som_clustering.train_pixel_som(fovs, channels, base_dir, subset_dir='pixel_mat_subsetted', norm_vals_name='post_rowsum_chan_norm.feather', som_weights_name='pixel_som_weights.feather', xdim=10, ydim=10, lr_start=0.05, lr_end=0.01, num_passes=1, seed=42, overwrite=False)[source]¶
Run the SOM training on the subsetted pixel data.
Saves SOM weights to
base_dir/som_weights_name
.- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of markers to subset on
base_dir (str) – The path to the data directories
subset_dir (str) – The name of the subsetted data directory
norm_vals_name (str) – The name of the file to store the 99.9% normalization values
som_weights_name (str) – The name of the file to save the SOM weights to
xdim (int) – The number of x nodes to use for the SOM
ydim (int) – The number of y nodes to use for the SOM
lr_start (float) – The start learning rate for the SOM, decays to
lr_end
lr_end (float) – The end learning rate for the SOM, decays from
lr_start
num_passes (int) – The number of training passes to make through the dataset
seed (int) – The random seed to use for training the SOM
overwrite (bool) – If set, force retrains the SOM and overwrites the weights
- Returns:
The SOM cluster object containing the pixel SOM weights
- Return type:
ark.phenotyping.pixie_preprocessing¶
- ark.phenotyping.pixie_preprocessing.create_fov_pixel_data(fov, channels, img_data, seg_labels, pixel_thresh_val, blur_factor=2, subset_proportion=0.1)[source]¶
Preprocess pixel data for one fov
- Parameters:
fov (str) – Name of the fov to index
channels (list) – List of channels to subset over
img_data (numpy.ndarray) – Array representing image data for one fov
seg_labels (numpy.ndarray) – Array representing segmentation labels for one fov
pixel_thresh_val (float) – value used to determine per-pixel cutoff for total signal inclusion
blur_factor (int) – The sigma to set for the Gaussian blur
subset_proportion (float) – The proportion of pixels to take from each fov
- Returns:
Contains the following:
pandas.DataFrame
: Gaussian blurred and channel sum normalized pixel data for a fovpandas.DataFrame
: subset of the preprocessed pixel dataset for a fov
- Return type:
- ark.phenotyping.pixie_preprocessing.create_pixel_matrix(fovs, channels, base_dir, tiff_dir, seg_dir, img_sub_folder='TIFs', seg_suffix='_whole_cell.tiff', pixel_output_dir='pixel_output_dir', data_dir='pixel_mat_data', subset_dir='pixel_mat_subsetted', norm_vals_name_pre_rownorm='channel_norm_pre_rownorm.feather', norm_vals_name_post_rownorm='channel_norm_post_rownorm.feather', pixel_thresh_name='pixel_thresh.feather', channel_percentile_pre_rownorm=0.99, channel_percentile_post_rownorm=0.999, is_mibitiff=False, blur_factor=2, subset_proportion=0.1, seed=42, multiprocess=False, batch_size=5)[source]¶
For each fov, add a Gaussian blur to each channel and normalize channel sums for each pixel
Saves data to
data_dir
and subsetted data tosubset_dir
- Parameters:
fovs (list) – List of fovs to subset over
channels (list) – List of channels to subset over, applies only to
pixel_mat_subset
base_dir (str) – The path to the data directories
tiff_dir (str) – Name of the directory containing the tiff files
seg_dir (str) – Name of the directory containing the segmented files. Set to
None
if no segmentation directory is available or desired.img_sub_folder (str) – Name of the subdirectory inside
tiff_dir
containing the tiff files. Set toNone
if there isn’t any.seg_suffix (str) – The suffix that the segmentation images use. Ignored if
seg_dir
isNone
.pixel_output_dir (str) – The name of the data directory containing the pixel data to use for the clustering pipeline.
data_dir
andsubset_dir
should be placed here.data_dir (str) – Name of the directory which contains the full preprocessed pixel data. Should be placed in
pixel_output_dir
.subset_dir (str) – The name of the directory containing the subsetted pixel data. Should be placed in
pixel_output_dir
.norm_vals_name_pre_rownorm (str) – The name of the file to store the pre-pixel-normalized norm values
norm_vals_name_post_rownorm (str) – The name of the file to store the post-pixel-normalized norm values
pixel_thresh_name (str) – The name of the file to store the pixel threshold value
channel_percentile_pre_rownorm (float) – Percentile used to normalize channels before pixel normalization
channel_percentile_post_rownorm (float) – Percentile used to normalize channels after pixel normalization
is_mibitiff (bool) – Whether to load the images from MIBITiff
blur_factor (int) – The sigma to set for the Gaussian blur
subset_proportion (float) – The proportion of pixels to take from each fov
seed (int) – The random seed to set for subsetting
multiprocess (bool) – Whether to use multiprocessing or not
batch_size (int) – The number of FOVs to process in parallel, ignored if
multiprocess
isFalse
- ark.phenotyping.pixie_preprocessing.preprocess_fov(base_dir, tiff_dir, data_dir, subset_dir, seg_dir, seg_suffix, img_sub_folder, is_mibitiff, channels, blur_factor, subset_proportion, pixel_thresh_val, seed, channel_norm_df, fov)[source]¶
Helper function to read in the FOV-level pixel data, run
create_fov_pixel_data
, and save the preprocessed data.- Parameters:
base_dir (str) – The path to the data directories
tiff_dir (str) – Name of the directory containing the tiff files
data_dir (str) – Name of the directory which contains the full preprocessed pixel data
subset_dir (str) – The name of the directory containing the subsetted pixel data
seg_dir (str) – Name of the directory containing the segmented files. Set to
None
if no segmentation directory is available or desired.seg_suffix (str) – The suffix that the segmentation images use. Ignored if
seg_dir
isNone
.img_sub_folder (str) – Name of the subdirectory inside
tiff_dir
containing the tiff files. Set toNone
if there isn’t any.is_mibitiff (bool) – Whether to load the images from MIBITiff
channels (list) – List of channels to subset over, applies only to
pixel_mat_subset
blur_factor (int) – The sigma to set for the Gaussian blur
subset_proportion (float) – The proportion of pixels to take from each fov
pixel_thresh_val (float) – The value to normalize the pixels by
seed (int) – The random seed to set for subsetting
channel_norm_df (pandas.DataFrame) – The channel normalization values to use
fov (str) – The name of the FOV to preprocess
- Returns:
The full preprocessed pixel dataset, needed for computing 99.9% normalized values in
create_pixel_matrix
- Return type:
ark.phenotyping.post_cluster_utils¶
- ark.phenotyping.post_cluster_utils.create_mantis_project(cell_table: pandas.DataFrame, fovs: List[str], seg_dir: Union[Path, str], mask_dir: Union[Path, str], image_dir: Union[Path, str], mantis_dir: Union[Path, str], pop_col: str = 'cell_meta_cluster', fov_col: str = 'fov', label_col: str = 'label', seg_suffix_name: str = '_whole_cell.tiff') None [source]¶
Creates a complete Mantis Project for viewing cell labels.
- Parameters:
cell_table (pd.DataFrame) – DataFrame of extracted cell features and subtypes.
fovs (List[str]) – A list of FOVs to use for creating the project.
seg_dir (Union[pathlib.Path, str]) – The path to the directory containing the segmentation images.
mask_dir (Union[pathlib.Path, str]) – The path to the directory where the masks will be stored.
image_dir (Union[pathlib.Path, str]) – The path to the directory containing the raw image data.
mantis_dir (Union[pathlib.Path, str]) – The path to the directory where the mantis project will be created.
pop_col (str, optional) – The column name containing the distinct cell populations. Defaults to
settings.CELL_TYPE
("cell_meta_cluster"
)fov_col (str, optional) – The column name containing the FOV IDs. Defaults to
settings.FOV_ID
("fov"
).label_col (str, optional) – The column name containing the cell label. Defaults to
settings.CELL_LABEL
("label"
).seg_suffix_name (str, optional) – The suffix of the segmentation file and it’s file extension. Defaults to
"_whole_cell.tiff"
.
- ark.phenotyping.post_cluster_utils.generate_new_cluster_resolution(cell_table, cluster_col, new_cluster_col, cluster_mapping, save_path)[source]¶
Add new column of more broad cell cluster assignments to the cell table.
- Parameters:
cell_table (pd.DataFrame) – cell table with clustered cell populations
cluster_col (str) – column containing the cell phenotype
new_cluster_col (str) – new column to create
cluster_mapping (dict) – dictionary with keys detailing the new cluster names and values explaining which cell types to group together
save_path (str) – where to save the new cell table
- ark.phenotyping.post_cluster_utils.plot_hist_thresholds(cell_table, populations, marker, pop_col='cell_meta_cluster', threshold=None, percentile=0.999)[source]¶
Create histograms to compare marker distributions across cell populations
- Parameters:
cell_table (pd.DataFrame) – cell table with clustered cell populations
populations (list) – populations to plot as stacked histograms
marker (str) – the marker used to generate the histograms
pop_col (str) – the column containing the names of the cell populations
threshold (float, None) – optional value to plot a horizontal line for visualization
percentile (float) – cap used to control x axis limits of the plot
ark.phenotyping.weighted_channel_comp¶
- ark.phenotyping.weighted_channel_comp.compute_cell_cluster_weighted_channel_avg(fovs, channels, base_dir, weighted_cell_channel_name, cell_cluster_data, cell_cluster_col='cell_meta_cluster')[source]¶
Computes the average weighted marker expression for each cell cluster
- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
weighted_cell_channel_name (str) – The name of the weighted cell table, created in
3_Pixie_Cluster_Cells.ipynb
cell_cluster_data (pandas.DataFrame) – Name of the file containing the cell data with cluster labels
cell_cluster_col (str) – Whether to aggregate by cell SOM or meta labels Needs to be either ‘cell_som_cluster’, or ‘cell_meta_cluster’
- Returns:
Each cell cluster mapped to the average expression for each marker
- Return type:
- ark.phenotyping.weighted_channel_comp.compute_p2c_weighted_channel_avg(pixel_channel_avg, channels, cell_counts, fovs=None, pixel_cluster_col='pixel_meta_cluster_rename')[source]¶
Compute the average marker expression for each cell weighted by pixel cluster
This expression is weighted by the pixel SOM/meta cluster counts. So for each cell, marker expression vector is computed by:
pixel_cluster_n_count * avg_marker_exp_pixel_cluster_n + ...
These values are then normalized by the cell’s respective size.
Note that this function will only be used to correct overlapping signal for visualization.
- Parameters:
pixel_channel_avg (pandas.DataFrame) – The average channel values for each pixel SOM/meta cluster Computed by
compute_pixel_cluster_channel_avg
channels (list) – The list of channels to subset
pixel_channel_avg
bycell_counts (pandas.DataFrame) – The dataframe listing the number of each type of pixel SOM/meta cluster per cell
fovs (list) – The list of fovs to include, if
None
provided all are usedpixel_cluster_col (str) – Name of the cell cluster column to group by Should be
'pixel_som_cluster'
or'pixel_meta_cluster_rename'
- Returns:
Returns the average marker expression for each cell in the dataset
- Return type:
- ark.phenotyping.weighted_channel_comp.generate_remap_avg_wc_files(fovs, channels, base_dir, cell_som_input_data, cell_remapped_name, weighted_cell_channel_name, cell_som_cluster_channel_avg_name, cell_meta_cluster_channel_avg_name)[source]¶
Apply the cell cluster remapping to the average weighted channel files
- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
cell_som_input_data (pandas.DataFrame) – The input data used for SOM training. For weighted channel averaging, this should contain the number of pixel SOM/meta cluster counts of each cell, normalized by
cell_size
.cell_remapped_name (str) – Name of the file containing the cell SOM clusters to their remapped meta clusters
weighted_cell_channel_name (str) – The name of the file containing the weighted channel expression table
cell_som_cluster_channel_avg_name (str) – The name of the file to save the average weighted channel expression per cell SOM cluster
cell_meta_cluster_channel_avg_name (str) – Same as above except for cell meta clusters
- ark.phenotyping.weighted_channel_comp.generate_wc_avg_files(fovs, channels, base_dir, cell_cc, cell_som_input_data, weighted_cell_channel_name='weighted_cell_channel.feather', cell_som_cluster_channel_avg_name='cell_som_cluster_channel_avg.csv', cell_meta_cluster_channel_avg_name='cell_meta_cluster_channel_avg.csv', overwrite=False)[source]¶
Generate the weighted channel average files per cell SOM and meta clusters.
When running cell clustering with pixel clusters generated from Pixie, the counts of each pixel cluster per cell is computed. These are multiplied by the average expression profile of each pixel cluster to determine weighted channel average. This computation is averaged by both cell SOM and meta cluster.
- Parameters:
fovs (list) – The list of fovs to subset on
channels (list) – The list of channels to subset on
base_dir (str) – The path to the data directory
cell_cc (cluster_helpers.PixieConsensusCluster) – The consensus cluster object containing the SOM to meta mapping
cell_som_input_data (str) – The input data used for SOM training. For weighted channel averaging, it should contain the number of pixel SOM/meta cluster counts of each cell, normalized by
cell_size
.weighted_cell_channel_name (str) – The name of the file containing the weighted channel expression table
cell_som_cluster_channel_avg_name (str) – The name of the file to save the average weighted channel expression per cell SOM cluster
cell_meta_cluster_channel_avg_name (str) – Same as above except for cell meta clusters
overwrite (bool) – If set, regenerate average weighted channel expression for SOM and meta clusters
- ark.phenotyping.weighted_channel_comp.generate_weighted_channel_avg_heatmap(cell_cluster_channel_avg_path, cell_cluster_col, channels, raw_cmap, renamed_cmap, center_val=0, min_val=-3, max_val=3)[source]¶
Generates a z-scored heatmap of the average weighted channel expression per cell cluster
- Parameters:
cell_cluster_channel_avg_path (str) – Path to the file containing the average weighted channel expression per cell cluster
cell_cluster_col (str) – The name of the cell cluster col, needs to be either ‘cell_som_cluster’ or ‘cell_meta_cluster_rename’
channels (str) – The list of channels to visualize
raw_cmap (dict) – Maps the raw meta cluster labels to their respective colors, created by
generate_meta_cluster_colormap_dict
renamed_cmap (dict) – Maps the renamed meta cluster labels to their respective colors, created by
generate_meta_cluster_colormap_dict
center_val (float) – value at which to center the heatmap
min_val (float) – minimum value the heatmap should take
max_val (float) – maximum value the heatmap should take