Information for Developers¶
If you wish to do higher-level development on top of ark
, we recommend setting up a virtual environment. We highly recommend using conda
virtual environments. To be able to set one up, you will need to install the Anaconda package.
Setting up Your Virtual Environment - Anaconda¶
Installing Anaconda¶
For a step-by-step guide of how to install Anaconda, please refer to these links:
https://docs.anaconda.com/anaconda/install/mac-os/ for Mac (x86_64 / Intel) users
https://github.com/conda-forge/miniforge/releases for Mac (arm64 / Apple Silicon) users
https://docs.anaconda.com/anaconda/install/windows/ for Windows users
Notes for Mac users
We recommend following the command line installer instructions as users have reported recent issues with the graphical installer.
To test if conda
has been added to your path, run conda info
in your Terminal. If you get an error message, it means conda
has not been added to your PATH
environment variable yet. To fix, run export PATH="/Users/yourname/anaconda3/bin:$PATH"
.
Apple Silicon Installation
You will need to install *miniforge* first. Miniforge contains conda with native Apple Silicon support. There are a few installation options available, all generally work the same way. Consult the documentation if you wish to read about them (using Mamba vs Conda for example).
Getting Miniforge
Option 1: (recommended) Install via homebrew .. code-block:: sh
brew install miniforge
Option 2: Download and Install via the terminal .. code-block:: sh
bash Miniforge-$(uname)-$(uname -m).sh
Initialize it for shell .. code-block:: sh
conda init
Creating a virtual environment¶
Now that Anaconda is installed, you can now create a conda
environment.
To do so, on your command line, type conda create -n <my_env> python=3.8
, where <my_env>
is a name you set. Our codebase currently supports up to Python 3.8.
Say yes to any prompts and your conda
environment will be created!
To verify installation, activate your conda
environment with conda activate <my_env>
. If you see (<my_env>)
on the far left of the command prompt, you have successfully created and activated your environment. Type conda deactivate
to exit at any time.
Setting up ark-analysis for development¶
ark
relies on several other Python packages. Inside the ark-analysis
repo (if you don’t have it, first run git clone https://github.com/angelolab/ark-analysis.git
), and with your virtual environment activated, you will need to install these other dependencies as well. Run pip install -e ".[test]"
to install ark
and it’s dependencies and testing dependencies.
You’re now set to start working with ark-analysis
! Please look at our contributing guidelines for more information about development. For detailed explanations of the functions available to you in ark
, please consult the Libraries section of this documentation.
Updating Ark Analysis in the Docker¶
Note that code changes aren’t automatically propagated into the Docker Image. However there may be times where you would like to work with and test out new changes and features.
You may update the current version of ark-analysis
by running the following commands
in the Jupyter Lab terminal.
cd /opt/ark-analysis
git pull
pip install .
Using ark functions directly¶
If you will only be using functions in ark
without developing on top of it, do not clone the repo. Simply run pip install ark-analysis
inside the virtual environment to gain access to our functions. To verify installation, type conda list ark-analysis
after completion. If ark-analysis
is listed, the installation was successful. You can now access the ark
library with import ark
.
Developing template notebooks via Docker¶
If you are using docker for your virtual environment, and plan to develop and commit template notebooks, then you should use the --develop-notebook-templates
flag for start_docker.sh
.
Typically, the ./templates
folder is copied into ./scripts
before starting docker and Jupyter is started within ./scripts
. This enables users of ark-analysis
to use the notebooks without dirtying the git working directory—doing so would cause merge conflicts on pull. When using --develop-notebook-templates
, ./templates
is used directly, so changes are changes reflected directly.
To enable, pass the either -d
or --develop-notebook-templates
to start_docker.sh
$ ./start_docker -d
Now notebooks can be git diff
ed and git commit
ed without having to copy changed notedbooks between ./scripts
and ./templates
.
Building Docker Images Locally¶
It may be useful to be able to manually build a new Docker Image as features get added, changes made and libraries updated. Specifically, updating Python libraries requires building a new docker image from scratch.
Once you are in ark-analysis
, the Docker Image can be built with the following command.
docker build -t ark-analysis .
The docker image will now build, and this process can take some time.
More on xarrays¶
One type of N-D array we use frequently is xarray
(documentation). The main advantages xarray
offers are:
Labeled dimension names
Flexible indexing types
While these can be achieved in numpy
to a certain extent, it’s much less intuitive. In contrast, xarray
makes it very easy to accomplish this.
Just as numpy
‘s base array is ndarray
, xarray
‘s base array is DataArray
. We can initialize it with a numpy
array as such (xarray
should always be imported as xr
):
arr = xr.DataArray(np.zeros((1024, 1024, 3)),
dims=['x', 'y', 'channel'],
coords=[np.arange(1024), np.arange(1024), ['red', 'green', 'blue']])
In this example, we assign the 0th, 1st, and 2nd dimensions to ‘x’, ‘y’, and ‘channel’ respectively. Both ‘x’ and ‘y’ are indexed with 0-1023, whereas ‘channel’ is indexed with RGB color names.
Indexing for xarray
works like numpy
. For example, to extract an xarray
with x=10:15, y=10:15, and channels=[‘red’, ‘blue’]:
arr.loc[10:15, 10:15, ['red', 'blue']]
This can also be extracted into a numpy
array using .values
:
arr.loc[10:15, 10:15, ['red', 'blue']].values
Note the use of .loc
in both cases. You do not have to use .loc
to index, but you will be forced to use integer indexes. The following is equivalent to the above:
arr[10:15, 10:15, [0, 2]].values
In most cases, we recommend using .loc
to get the full benefit of xarray
. Note that this can also be used to assign values as well:
arr.loc[10:15, 10:15, ['red', 'blue']] = 255
To access the coordinate names, use arr.dims
, and to access specific coordinate indices, use arr.coord_name.values
.
Finally, to save an xarray
to a file, use:
arr.to_netcdf(path, format="NETCDF3_64BIT")
You can load the xarray
back in using:
arr = xr.load_dataarray(path)
Working with AnnData
¶
We can load a single AnnData
object using the function anndata.read_zarr
, and several AnnData
objects using the function load_anndatas
from ark.utils.data_utils
.
from anndata import read_zarr
from ark.utils.data_utils import load_anndatas
fov0 = read_zarr("data/example_dataset/fov0.zarr")
The channel intensities for each observation in the AnnData
object with the .to_df()
method, and get the channel names with .var_names
.
fov0.var_names
fov0.to_df()
The observations and their properties with the obs
property of the AnnData
object. The data here consists of measurements such as area
, perimeter
, and categorical information like cell_meta_cluster
for each cell.
fov0.obs
The $x$ and $y$ centroids of each cell can be accessed with the obsm
attribute and the key "spatial"
.
fov0.obsm["spatial"]
We can load all the AnnData
objects in a directory lazily with load_anndatas
. We get a view of the AnnData
objects in the directory.
fovs_ac = load_anndatas(anndata_dir = "data/example_dataset/fov0.zarr")
We can utilize AnnData
objects or AnnCollections
in a similar way to a Pandas DataFrame. For example, we can filter the AnnCollection
to only include cells that have a cell_meta_cluster
label of "CD4T"
.
fovs_ac_cd4t = fovs_ac[fovs_ac.obs["cell_meta_cluster"] == "CD4T"]
print(type(fovs_ac_cd4t))
fovs_ac_cd4t.obs.df
The type of fovs_ac_cd4t
is not an AnnData
object, but instead an AnnCollectionView
.
This is a view
of the subset of the AnnCollection
. This object can only access .obs
, .obsm
, .layers
and .X
.
We can subset a AnnCollectionView
to only include the first $n$ observations objects with the following code. The slice based indexing behaves like a numpy
array.
n = 100
fovs_ac_cdt4_100 = fovs_ac_cd4t[:n]
fovs_ac_cd4t_100.obs.df
Often we will want to subset the AnnCollection
to only include observations contained within a specific FOV.
fov1_adata = fovs_ac[fovs_ac.obs["fov"] == "fov1"]
fov1_adata.obs.df
We can loop over all FOVs in a AnnCollection
with the following code (there is alternative method in ):
all_fovs = fovs_ac.obs["fov"].unique()
for fov in all_fovs:
fov_adata = fovs_ac[fovs_ac.obs["fov"] == fov]
# do something with fov_adata
Functions which take in AnnData
objects can often be applied to AnnCollections
.
The following works as expected:
def dist(adata):
x = adata.obsm["spatial"]["centroid_x"]
y = adata.obsm["spatial"]["centroid_y"]
return np.sqrt(x**2 + y**2)
dist(fovs_ac)
While the example below does not:
from squidpy import gr
gr.spatial_neighbors(adata=fovs_ac, spatial_key="spatial")
This is due to a AnnCollection
object not having a uns
property.
Utilizing DataLoaders
¶
While a AnnCollection
can sometimes be used to apply a function over all FOVs, in some instances we either cannot do that, or perhaps we want to apply functions to each FOV independently.
We can access the underlying AnnData
objects with .adatas
.
fovs_ac.adatas
In these instances we can construct data pipelines with ``torchdata`
<https://pytorch.org/data/beta/index.html>`_.
As an example, let’s create a multi-stage DataLoader
which does the following:
Only extracts the observations with an area greater than
300
.Only extracts the observations which have a
cell_meta_cluster
label of"CD4T"
.Compute the Spatial Neighbors graph for those observations (using
``squidpy.gr.spatial_neighbors`
<https://squidpy.readthedocs.io/en/stable/api/squidpy.gr.spatial_neighbors.html#squidpy.gr.spatial_neighbors>`_).
In order to construct a torchdata
``DataLoader2`
<https://pytorch.org/data/beta/dataloader2.html>`_ iterator we first need to create a torchdata
``IterDataPipe`
<https://pytorch.org/data/beta/torchdata.datapipes.iter.html>`_. This implements the __iter__()
protocol, and represents an iterable over data samples.
We can convert the AnnCollection
to a torchdata
IterDataPipe
with ark.utils.data_utils.AnnDataIterDataPipe
.
from ark.utils.data_utils import AnnDataIterDataPipe
fovs_ip = AnnDataIterDataPipe(fovs=fovs_ac)
The following two functions are used to filter the observations in the AnnData
objects
to only include cells with an area greater than min_area
and cells with a cell_meta_cluster
label of cluster
.
from anndata import AnnData
def filter_cells_by_cluster(adata: AnnData, cluster_label: str) -> AnnData:
return adata[adata.obs["cell_meta_cluster"] == cluster_label]
def filter_cells_by_area(adata: AnnData, min_area: int) -> AnnData:
return adata[adata.obs["area"] > min_area]
The following function is used to filter out AnnData
objects which have no observations.
def filter_empty_adata(fov: AnnData) -> bool:
return len(fov) > 0
We can apply these functions to the IterDataPipe
with the map
and the filter
method.
Because those methods return a new IterDataPipe
object, we can chain them together.
from functools import partial
cd4t_obs_filter = partial(filter_cells_by_cluster, cluster_label="CD4T")
area_obs_filter = partial(filter_cells_by_area, min_area=300)
fovs_subset = fovs_ip.map(cd4t_obs_filter).map(area_obs_filter).filter(filter_empty_adata)
The data pipeline can be visualized with to_graph
function.
from torchdata.datapipes.utils import to_graph
to_graph(fovs_subset)
The DataLoader
can now be constructed.
from torchdata.dataloader2.dataloader2 import DataLoader2
fovs_subset_dl = DataLoader2(fovs_subset)
We can now loop over the DataLoader
and compute the Spatial Neighbors graph per FOV with the filtered observations.
for fov in fovs_subset_dl:
gr.spatial_neighbors(adata=fov, radius=350, spatial_key="spatial", coord_type="generic")