CAD-score-LT version 0.9

About CAD-score-LT

CAD-score-LT software computes CAD-score (Contact Area Difference score): a superposition-free similarity measure based on contact areas. CAD-score-LT is based on the Voronota-LT software. CAD-score-LT is much faster and more versatile than the previous CAD-score implementations. It is designed to work for both global comparisons and for localized analyses such as interfaces and binding sites.

CAD-score-LT is open source (MIT license) and hosted on GitHub as a project in the Voronota monorepo.

CAD-score definition for residue-residue contacts

Given reference structure T (target) and structure to be compared, M (model), let G denote the set of all the pairs of residues (i,j) that have a non-zero contact area T(i,j) in the target structure. Then for every residue pair (i,j) ∈ G the corresponding contact area M(i,j) in the model is calculated. M(i,j) is assigned zero if there is no contact between residues i and j in the model or if either residue (i or j) is missing from the model. The CAD-score for the model structure is then defined as:

$$\text{CAD-score}(G)=1-\frac {\sum_{(i,j) \in G} \min(|T_{(i,j)}-M_{(i,j)}|,T_{(i,j)})} {\sum_{(i,j) \in G}T_{(i,j)} }$$

CAD-score values are always within the [0,1] range. If model and target structures are identical, CAD-score(G) = 1. At the other extreme, if not a single contact is reproduced with sufficient accuracy, CAD-score(G) = 0.

More local scores (e.g. interface-scores or per-residue scores) are calculated by applying the CAD-score formula to a restricted subset of G.

CAD-score definition for other types of areas

If areas in a target and in a model have identifiers, and those identifiers can be matched, then CAD-score can be calculated. For example, if we want to compare solvent-accessible (SAS) areas, we can assign identifiers to such areas. Let us formulate the CAD-score definition for per-residue SAS areas in a manner similar to the CAD-score definition for residue-residue contacts.

Given reference structure T (target) and structure to be compared, M (model), let G denote the set of all the residues i that have a non-zero SAS area T(i) in the target structure. Then for every residue i ∈ G the corresponding SAS area M(i) in the model is calculated. M(i) is assigned zero if there is no SAS for residue i in the model or if residue i is missing from the model. The SAS-based CAD-score for the model structure is then defined as:

$$\text{CAD-score}(G)=1-\frac {\sum_{i \in G} \min(|T_{(i)}-M_{(i)}|,T_{(i)})} {\sum_{i \in G}T_{(i)} }$$

Similarly we can calculate CAD-score for per-residue binding site areas produced by accumulating contact areas.

Overall, on CAD-score-LT, three types of areas can be analyzed:

Areas of every type can be assessed on three levels of detail:

All the areas are intitially computed on the atom level. Then, if needed, the atom-level areas are aggregated on residue or/and chain levels.

Benchmarking data and results

Benchmarking data and results are available here.

Quickest install guide

Since CAD-score-LT version 0.8.106, universal binary execuitables of CAD-score-LT built with the Cosmopolitan Libc toolkit are provided.

To download and prepare the latest released cosmopolitan executable, run the following commands in a shell environment (e.g. a Bash shell):

wget 'https://github.com/kliment-olechnovic/voronota/releases/download/v1.29.4771/cosmopolitan_cadscore-lt_v0.9.167.exe'
mv cosmopolitan_cadscore-lt_v0.9.167.exe cadscore-lt
chmod +x cadscore-lt

In case of a PowerShell environment in Windows 8, the setup can be done with a single command:

Invoke-WebRequest -Uri 'https://github.com/kliment-olechnovic/voronota/releases/download/v1.29.4771/cosmopolitan_cadscore-lt_v0.9.167.exe' -OutFile cadscore-lt.exe

Building the command-line tool from source code

Download the latest CAD-score-LT source archive from the official downloads page at https://github.com/kliment-olechnovic/voronota/releases, e.g. cadscore-lt_v0.9.167.tar.gz.

Requirements

The CAD-score-LT executable can be built from the provided source code to work on any modern Linux, macOS or Windows operating systems.

CAD-score-LT has no required external library dependencies, only a C++17-compliant compiler is needed to build it.

Using CMake

You can build using CMake for makefile generation.

Change to the CAD-score-LT directory with CMakeLists.txt file, then run the sequence of commands:

cmake ./
make

Alternatively, to keep files more organized, CMake can be run in a separate “build” directory:

mkdir build
cd build
cmake ../
make
cp ./cadscore-lt ../cadscore-lt

Using C++ compiler directly

For example, “cadscore-lt” executable can be built using GNU C++ compiler.

Then run the CAD-score-LT directory and run the compilation command:

g++ -std=c++17 -O3 -fopenmp -I ./src -o ./cadscore-lt ./src/cadscore_lt.cpp

Performance-boosting compiler flags can be included:

g++ -std=c++17 -Ofast -march=native -fopenmp -I ./src -o ./cadscore-lt ./src/cadscore_lt.cpp

Basic usage

Listing command-line options

Running

cadscore-lt -h

prints an overview of the command line interface:

CAD-score-LT version 0.9

'cadscore-lt' calculates CAD-score (Contact Area Difference score).

Options:
    --targets | -t                                   string     input file or directory paths for target (reference) structure files
    --models | -m                                    string     input file or directory paths for model structure files
    --processors                                     number     maximum number of OpenMP threads to use, default is 2 if OpenMP is enabled, 1 if disabled
    --recursive-directory-search                                flag to search directories recursively
    --include-heteroatoms                                       flag to include heteroatoms when reading input in PDB or mmCIF format
    --read-inputs-as-assemblies                                 flag to join multiple models into an assembly when reading a file in PDB or mmCIF format
    --radii-config-file                              string     input file path for reading atom radii assignment rules
    --probe                                          number     rolling probe radius, default is 1.4
    --restrict-raw-input                             string     selection expression to restrict input atoms before any chain renaming or residue renumbering
    --reference-sequences-file                       string     input file path for reference sequences in FASTA format
    --reference-stoichiometry                        numbers    list of stoichiometry values to apply when mapping chains to reference sequences
    --restrict-processed-input                       string     selection expression to restrict input atoms after all chain renaming and residue renumbering
    --save-processed-inputs-mmcif                               flag to save processed input structures in mmCIF format to the output directory
    --save-processed-inputs-pdb                                 flag to save processed input structures in PDB format to the output directory
    --save-sequence-alignments                                  flag to save best alignments with reference sequences into a file in the output directory
    --quit-before-scoring                                       flag to exit before scoring but after all the input processing and saving
    --subselect-contacts                             string     selection expression to restrict contact area descriptors to score, default is '[-min-sep 1]'
    --subselect-atoms                                string     selection expression to restrict atom SAS and site area descriptors to score, default is '[]'
    --conflate-atom-types                                       flag to conflate known equivalent atom types
    --conflation-config-file                         string     input file path for reading atom name conflation rules
    --scoring-types                                  strings    scoring types ('contacts', 'SAS', 'sites'), default is 'contacts'
    --scoring-levels                                 strings    scoring levels ('atom', 'residue', 'chain'), default is 'residue'
    --local-output-formats                           strings    list of formats (can include 'table', 'pdb', 'mmcif', 'contactmap', 'graphics-pymol', 'graphics-chimera')
    --local-output-levels                            strings    list of output levels (can include 'atom', 'residue', 'chain'), default is 'residue'
    --consider-residue-names                                    flag to include residue names in residue and atom identifiers, making mapping more strict
    --binarize-areas                                            flag to binarize (convert to 0 or 1) all area values before scoring
    --remap-chains                                              flag to automatically rename chains in models to maximize residue-residue contacts global score
    --max-chains-to-fully-permute                    number     limit of chain combinations to chech exhaustively when remapping chains, default is 200
    --clustering-thresholds                          numbers    clustering thresholds for Taylor-Butina-like clustering if in all-to-all comparison mode
    --max-renaming-cache-size                        number     max number of contact sets to cache when doing comparisons to multiple targets, default is 400
    --print-paths-in-output                                     flag to print file paths instead of file base names in output
    --output-with-f1                                            flag to output area-based F1 scores along with CAD-scores
    --output-with-areas                                         flag to output all recorded types of areas in tables of global and local scores
    --output-with-identities                                    flag to output identity percentages (for input atoms, residues, chains) along with CAD-scores
    --compact-output                                            flag to reduce size of output global scores table without removing rows
    --extremely-compact-output                                  flag to reduce size of output global scores by writing them as an integer matrix
    --output-global-scores                           string     path to output table of global scores, default is '_stdout' to print to standard output 
    --output-dir                                     string     path to output directory for all result files
    --help | -h                                                 flag to print help info to stderr and exit

Standard output stream:
    Global scores

Standard error output stream:
    Error messages

Usage examples:

    cadscore-lt -t ./target.pdb -m ./model1.pdb ./model2.pdb

    cadscore-lt -t ./target.pdb -m ./model1.pdb ./model2.pdb --subselect-contacts '[-inter-chain]'

Choosing scoring types and scoring levels

CAD-score-LT supports several scoring types and scoring levels.

Scoring types (--scoring-types):

Scoring levels (--scoring-levels):

Residue-level contacts is the default mode. That is, running

cadscore-lt \
  -t ./target.pdb \
  -m ./model1.pdb

is equivalent to running:

cadscore-lt \
  -t ./target.pdb \
  -m ./model1.pdb \
  --scoring-types contacts \
  --scoring-levels residue

Multiple options (even all of them) can be used together, for example:

cadscore-lt \
  -t ./target.pdb \
  -m ./model1.pdb \
  --scoring-types contacts sas sites \
  --scoring-levels atom residue chain

Providing inputs

For input, CAD-score-LT accepts files in PDB or mmCIF formats.

Basic usage (one target, one model):

cadscore-lt -t target.pdb -m model1.pdb

One target, multiple models:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb

Read models from stdin (convenient for find):

find ./input/ -name 'model*.pdb' | cadscore-lt -t ./input/target.pdb

All-vs-all mode (no --target or -t given) is entered when no targets are provided, then CAD-score-LT sets targets = models and compares all pairs (excluding self-pairs). For example:

find ./input/ -name '*.pdb' | cadscore-lt

All-vs-all mode is convenient for clustering structures or their parts (e.g. inter-chain interfaces).

Outputting a global scores table

By default, cadscore-lt prints a tab-separated table of global scores to stdout.

Useful flags:

Useful tip: use column -t command to align column value in a pretty way.

Examples:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb --output-with-f1 | column -t
cadscore-lt -t target.pdb -m model1.pdb model2.pdb --output-all-details | column -t

Selecting what to score

CAD-score-LT lets you:

Both atom-focused and contact focused expressions should be provided in the system described in the next section.

Filtering (selection) system

Some optional arguments of CAD-score-LT expect selection expressions:

--restrict-input-atoms                           string     selection expression to restrict input balls
--restrict-contacts                              string     selection expression to restrict contacts before construction
--restrict-contacts-for-output                   string     selection expression to restrict contacts for output
--restrict-atom-descriptors-for-output           string     selection expression to restrict single-index data (balls, cells, sites) for output

The expressions need to be specified using the filtering (selection) language described in this section. The language is used to select atoms or atom-based descriptors (like cells and sites) and contacts.

The language supports explicit boolean logic with round brackets to avoid any ambiguity in operator precedence. It is recommended to always use round brackets when combining filters with logical operators. Even when redundant, over-bracketing makes expressions easier to read and safer for users unfamiliar with precedence rules.

Filter blocks

A filter block is written in square brackets:

[ <clause> <clause> ... ]

Boolean expressions

Filter blocks can be combined into boolean expressions.

Supported operators:

Logical meaning Accepted forms
AND and, &, &&
OR or, |, ||
NOT not, !

Round brackets ( ) may be nested freely and are encouraged.

Examples

[ ... ]
( [ ... ] and [ ... ] )
( ( [ ... ] ) or ( not [ ... ] ) )
( ( ( [ ... ] and [ ... ] ) or ( [ ... ] ) ) and ( not [ ... ] ) )

The same boolean expression syntax applies to atom selection and contact selection.

Atom (or atom-based descriptor) selection

Atom filters match atoms based on chain, residue, atom name, element, and biochemical category.

Chain selection

-chain <list>        or  -c <list>
-chain-not <list>    or  -c! <list>

Examples:

[ -chain A ]
[ -c A,B ]
[ -c! C ]

Residue number (ranges use ‘:’)

-residue-number <intervals>        or  -rnum <intervals>
-residue-number-not <intervals>    or  -rnum! <intervals>

Examples:

[ -rnum 42 ]
[ -rnum 10:20 ]
[ -rnum! 1:5 ]

Residue name

-residue-name <list>        or  -rname <list>
-residue-name-not <list>    or  -rname! <list>

Examples:

[ -rname ALA ]
[ -rname ALA,GLY,SER ]
[ -rname! PRO ]

Atom name

-atom-name <list>        or  -aname <list>
-atom-name-not <list>    or  -aname! <list>

Examples:

[ -aname CA ]
[ -aname CA,CB ]
[ -aname! H ]

Element

-element <list>        or  -elem <list>
-element-not <list>    or  -elem! <list>

Examples:

[ -elem C ]
[ -elem N,O ]
[ -elem! H ]

Residue ID (combined identifier)

-residue-id <list>        or  -rid <list>
-residue-id-not <list>   or  -rid! <list>

Supported formats:

Examples:

[ -rid 42|ALA ]
[ -rid 101/B ]

Protein-specific macros

These predefined macros expand internally to standard residue and atom sets.

Macro Description
-protein All standard protein residues
-protein-backbone Backbone atoms (N, CA, C, O)
-protein-sidechain Side-chain atoms

Examples:

[ -protein ]
[ -protein-backbone ]
[ -protein-sidechain ]

Nucleic-acid–specific macros

These macros select standard nucleic-acid residues and optionally restrict atoms to backbone or base (side-chain) atoms.

Macro Description
-nucleic All nucleic acids (DNA and RNA)
-nucleic-dna DNA residues only
-nucleic-rna RNA residues only
-nucleic-backbone Sugar-phosphate backbone atoms (DNA or RNA)
-nucleic-sidechain Base atoms (DNA or RNA)
-nucleic-dna-backbone DNA backbone atoms
-nucleic-dna-sidechain DNA base atoms
-nucleic-rna-backbone RNA backbone atoms
-nucleic-rna-sidechain RNA base atoms

Conceptually:

Atom-selection examples

ALA CB atom:

[ -rname ALA -aname CB ]

C-alpha atoms in chain A:

[ -chain A -aname CA ]

Protein side-chain atoms in residues 50–100, excluding PRO:

( [ -protein-sidechain -rnum 50:100 ] ) and ( not [ -rname PRO ] )

Contact selection

Contact filters match a pair of atom groups (atom1, atom2) and optional relationship constraints.

Selecting atom1 and atom2

Positive forms:

-atom1 <atom-filter>   or  -a1 <atom-filter>
-atom2 <atom-filter>   or  -a2 <atom-filter>

Negation forms:

-atom1-not <atom-filter>   or  -a1!
-atom2-not <atom-filter>   or  -a2!

Example:

[ -a1 [ -rname ALA -aname CB ] -a2 [ -protein-backbone ] ]

Chain and residue relationships

Clause Meaning
-inter-chain Atoms from different chains
-intra-chain Atoms from the same chain
-inter-residue Atoms from different residues

Examples:

[ -inter-chain ]
[ -inter-residue ]

Sequence separation (same chain)

-min-sep <int>
-max-sep <int>

Examples:

[ -min-sep 5 ]
[ -max-sep 10 ]

Distance constraint

-max-dist <float>

Example:

[ -max-dist 3.5 ]

Contact-selection examples

ALA CB with protein backbone:

[ -a1 [ -rname ALA -aname CB ] -a2 [ -protein-backbone ] ]

Inter-chain protein–protein contacts:

[ -a1 [ -protein ] -a2 [ -protein ] -inter-chain ]

Short side-chain contacts:

( [ -a1 [ -protein-sidechain ] -a2 [ -protein-sidechain ] -max-dist 3.5 ] )

Complex example:

(
  (
    ( [ -a1 [ -aname CA ] -a2 [ -protein-sidechain ] ] )
    and
    ( [ -inter-residue ] )
  )
  or
  ( [ -max-dist 3.0 ] )
)

Examples of using selection expressions in CAD-score-LT

Restricting input atoms

Example of restricting atoms before any chain renaming and residue renumbering:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --restrict-raw-input "[-chain A]" \
| column -t

Example of restricting atoms after any chain renaming and residue renumbering:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --reference-sequences-file ./sequences.fasta \
  --reference-stoichiometry 2 2 1 \
  --restrict-processed-input "[-chain A,B]" \
| column -t

Selecting contacts

Using --subselect-contacts is the most common way to focus CAD-score.

Default contact selection is [-min-sep 1] (to discard contacts between atoms in the same residue).

Example of sidechain–sidechain contact scoring:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --subselect-contacts "[-min-sep 1 -atom1 [-protein-sidechain] -atom2 [-protein-sidechain]]" \
| column -t

Example of scoring contacts between chain A and chain B:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --subselect-contacts "[-a1 [-chain A] -a2 [-chain B]]" \
| column -t

Example of scoring contacts between any different chains:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --subselect-contacts "[-inter-chain]" \
| column -t

Example of scoring contacts between residue sets (example with ranges):

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --subselect-contacts "[-a1 [-chain B -rnum 39:51] -a2 [-chain B -rnum 39:66,75:87]]" \
| column -t

Selections can be combined with boolean expressions, as shown below:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-all-details \
  --subselect-contacts "(([-a1 [-chain B -rnum 39:51] -a2 [-chain B -rnum 39:66]]) or ([-a1 [-chain B -rnum 39:51] -a2 [-chain B -rnum 75:87]]))" \
| column -t

Select atoms for SAS/sites

When scoring SAS or binding sites, you can select which atoms are considered using the --subselect-atoms option:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --subselect-contacts "[-inter-chain]" \
  --scoring-types "sites" \
  --subselect-atoms "[-chain A]" \
  --output-all-details \
| column -t

Local scoring outputs

CAD-score-LT can write local scores and visualization helpers into an output directory.

Options to use:

Formats (--local-output-formats) can include:

Example of writing local per-residue scores as as tables and as PDB B-factors:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --output-dir ./results \
  --local-output-formats table pdb \
| column -t

Example of writing interface local scores:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --subselect-contacts "[-inter-chain]" \
  --output-dir ./results_inter_chain \
  --local-output-formats table mmcif contactmap graphics-pymol \
| column -t

PDB/mmCIF outputs are typically easiest to inspect in molecular viewers by coloring residues by B-factor. CAD-score also sets occupancy values - 1 for every atom that contributed to any areas involved in scoring, 0 for all the other atoms. For example, when scoring an inter-chain interface only atoms involved in inter-chain contacts will have occupancy value of 1.

Chain remapping for complexes

When scoring multimeric complexes, model chain IDs may not match target chain IDs (especially homomers).

CAD-score-LT can automatically rename chains in models to try to maximize the global residue–residue contacts score. For that, use the --remap-chains flag:

cadscore-lt -t target.pdb -m model1.pdb model2.pdb \
  --subselect-contacts "[-inter-chain]" \
  --remap-chains \
| column -t

Clustering of models

When running in all-versus-all comparison mode, CAD-score-LT can cluster models with a Taylor-Butina-like algorithm - the same algorithm as used for clustering in PPI3D. Multiple CAD-score thresholds can be provided to obtain multiple clusterings in one run.

For example:

find ./input/protein_rna/ -type f -name '*model*' \
| cadscore-lt \
  --scoring-levels residue \
  --scoring-types contacts \
  --subselect-contacts '[-a1 [-protein] -a2 [-nucleic]]' \
  --output-dir ./output_dir_v1 \
  --clustering-thresholds 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

This will produce both all-versus-all global scores and clustering info in the following files in the output directory:

An example of a input_model_files.tsv file:

0  6FPQ_model1a.pdb
1  6FPQ_model1b.pdb
2  6FPQ_model2a.pdb
3  6FPQ_model2b.pdb
4  6FPQ_model3a.pdb
5  6FPQ_model3b.pdb

An example of a global_scores.tsv file:

target            model             t_id  m_id  residue_residue_cadscore
6FPQ_model3a.pdb  6FPQ_model3b.pdb  4     5     0.878349
6FPQ_model3b.pdb  6FPQ_model3a.pdb  5     4     0.877782
6FPQ_model1a.pdb  6FPQ_model1b.pdb  0     1     0.854266
6FPQ_model1b.pdb  6FPQ_model1a.pdb  1     0     0.849624
6FPQ_model2b.pdb  6FPQ_model2a.pdb  3     2     0.84137
6FPQ_model2a.pdb  6FPQ_model2b.pdb  2     3     0.837263
6FPQ_model1a.pdb  6FPQ_model2a.pdb  0     2     0.377615
6FPQ_model1a.pdb  6FPQ_model2b.pdb  0     3     0.373931
6FPQ_model2a.pdb  6FPQ_model1a.pdb  2     0     0.368113
6FPQ_model1b.pdb  6FPQ_model2b.pdb  1     3     0.36464
6FPQ_model1b.pdb  6FPQ_model2a.pdb  1     2     0.35661
6FPQ_model2a.pdb  6FPQ_model1b.pdb  2     1     0.344747
6FPQ_model2b.pdb  6FPQ_model1a.pdb  3     0     0.344366
6FPQ_model2b.pdb  6FPQ_model1b.pdb  3     1     0.332512
6FPQ_model3a.pdb  6FPQ_model2a.pdb  4     2     0.201901
6FPQ_model2b.pdb  6FPQ_model3a.pdb  3     4     0.188873
6FPQ_model3a.pdb  6FPQ_model2b.pdb  4     3     0.188156
6FPQ_model3b.pdb  6FPQ_model2a.pdb  5     2     0.183383
6FPQ_model2a.pdb  6FPQ_model3a.pdb  2     4     0.183358
6FPQ_model3b.pdb  6FPQ_model2b.pdb  5     3     0.173589
6FPQ_model2b.pdb  6FPQ_model3b.pdb  3     5     0.16057
6FPQ_model2a.pdb  6FPQ_model3b.pdb  2     5     0.152587
6FPQ_model1a.pdb  6FPQ_model3a.pdb  0     4     0.13624
6FPQ_model1b.pdb  6FPQ_model3b.pdb  1     5     0.129939
6FPQ_model1b.pdb  6FPQ_model3a.pdb  1     4     0.129754
6FPQ_model1a.pdb  6FPQ_model3b.pdb  0     5     0.125677
6FPQ_model3a.pdb  6FPQ_model1a.pdb  4     0     0.089375
6FPQ_model3b.pdb  6FPQ_model1b.pdb  5     1     0.087488
6FPQ_model3b.pdb  6FPQ_model1a.pdb  5     0     0.084623
6FPQ_model3a.pdb  6FPQ_model1b.pdb  4     1     0.078454

When dealing with large sets of models (e.g. thousands of models), global scores in the default vebose format may take a lot of space. In such cases it is recommended to use the --extremely-compact-output output flag, for example:

find ./input/protein_rna/ -type f -name '*model*' \
| cadscore-lt \
  --scoring-levels residue \
  --scoring-types contacts \
  --subselect-contacts '[-a1 [-protein] -a2 [-nucleic]]' \
  --output-dir ./output_dir_v2 \
  --extremely-compact-output \
  --clustering-thresholds 0.3 0.4 0.5 0.6 0.7 0.8 0.9

This will output global 0-1 scores converted to integer 0-100 values and save the in the output_dir_v2/residue_residue_cadscore_global.tsv file. That file holds a matrix, where rows and columns correspond to model ordering in the output_dir_v2/input_model_files.tsv file.

Overall, the “extremely compact” regime will produce the following files relevant to the clustering:

An example of a input_model_files.tsv matrix file:

6FPQ_model1a.pdb
6FPQ_model1b.pdb
6FPQ_model2a.pdb
6FPQ_model2b.pdb
6FPQ_model3a.pdb
6FPQ_model3b.pdb

An example of a residue_residue_cadscore_global.tsv matrix file:

100  85   37   34   9    8
85   100  34   33   8    9
38   36   100  84   20   18
37   36   84   100  19   17
14   13   18   19   100  88
13   13   15   16   88   100

The generated clustering data files are the same for both the default and the “extremely compact” regimes.

An example of a residue_residue_cadscore_clusters.tsv file:

name              threshold_30  threshold_40  threshold_50  threshold_60  threshold_70  threshold_80  threshold_90
6FPQ_model1a.pdb  1             1             1             1             1             1             1
6FPQ_model1b.pdb  1             1             1             1             1             1             2
6FPQ_model2a.pdb  1             2             2             2             2             2             3
6FPQ_model2b.pdb  1             2             2             2             2             2             4
6FPQ_model3a.pdb  2             3             3             3             3             3             5
6FPQ_model3b.pdb  2             3             3             3             3             3             6

An example of a residue_residue_cadscore_cluster_representatives.tsv file:

threshold_percents  cluster_id  representative    cluster_size

30                  1           6FPQ_model1a.pdb  4
30                  2           6FPQ_model3a.pdb  2

40                  1           6FPQ_model1a.pdb  2
40                  2           6FPQ_model2a.pdb  2
40                  3           6FPQ_model3a.pdb  2

50                  1           6FPQ_model1a.pdb  2
50                  2           6FPQ_model2a.pdb  2
50                  3           6FPQ_model3a.pdb  2

60                  1           6FPQ_model1a.pdb  2
60                  2           6FPQ_model2a.pdb  2
60                  3           6FPQ_model3a.pdb  2

70                  1           6FPQ_model1a.pdb  2
70                  2           6FPQ_model2a.pdb  2
70                  3           6FPQ_model3a.pdb  2

80                  1           6FPQ_model1a.pdb  2
80                  2           6FPQ_model2a.pdb  2
80                  3           6FPQ_model3a.pdb  2

90                  1           6FPQ_model1a.pdb  1
90                  2           6FPQ_model1b.pdb  1
90                  3           6FPQ_model2a.pdb  1
90                  4           6FPQ_model2b.pdb  1
90                  5           6FPQ_model3a.pdb  1
90                  6           6FPQ_model3b.pdb  1

Pre-processing small-molecule ligand input

For consistent naming of input small-molecule ligand atoms, and for determining small-molecule atom equivalence classes, extra data-preparation utilities for CAD-score-LT are provided in a separate repository.

The available tools are:

The usage of those tools is not required, but suggested in order to ensure appropriate handling of ligand atom names to not penalize atom-level CAD-score.

Equivalence classes data

The folloving optional argument of CAD-score-LT accepts atom equivalence information input:

--conflation-config-file                         string     input file path for reading atom name conflation rules

The expected file format is tab-separated table with three columns: residue name, atom name, generated equivalence class name. Such table files can be generated by the group-equivalent-ligand-atoms.py script.

For example, run the script and save the output to a file:

python3 ./group-equivalent-ligand-atoms.py ./1CNW.cif > ./table.tsv

The file table.tsv should contain a table where each row corresponds to an atom that belongs to a multi-member equivalence class within a ligand:

EG1    O1    OX4
EG1    O2    OX4
EG1    C2    CX12
EG1    C3    CX13
EG1    C5    CX13
EG1    C6    CX

CAD-score-LT Python bindings

The CAD-score-LT Python interface PyPI package is hosted at https://pypi.org/project/cadscorelt/.

Installation

Install with pip using this command:

pip install cadscorelt

Additionally, it is recommended to have the pandas library for data analysis available in the Python environment. This allows the CAD-score result tables to be converted to pandas data frames. The pandas library can also be installed using pip:

pip install pandas

CAD-score-LT also provides integration with some common libraries for reading macromolecular files - Biotite, Gemmi, Biopython, if those libraries are available in the Python environment. They can be installed via pip:

pip install biotite
pip install gemmi
pip install biopython

Usage examples

Basic example

Below is an example script that calculates CAD-scores for inter-chain residue-residue contact areas, produces a table of global scores, converts that table to pandas data frame, and preints the top rows of the data frame:

import cadscorelt

# init a CAD-score computation object
css = cadscorelt.CADScoreComputer.init(subselect_contacts="[-inter-chain]")

# add a target structure
css.add_target_structure_from_file("./input/data/protein_homodimer1/target.pdb")

# add a target structure
css.add_model_structure_from_file("./input/data/protein_homodimer1/model1.pdb")
css.add_model_structure_from_file("./input/data/protein_homodimer1/model2.pdb")

# get a list of global scores and convert it to pandas data frame
df_global_scores_residue_residue = css.get_all_cadscores_residue_residue_summarized_globally().to_pandas()

# print the first rows of the data frame
cadscorelt.print_head_of_pandas_data_frame(df_global_scores_residue_residue)

Below is an example of the printed output:

target_name model_name  CAD_score  F1_of_areas  target_area  model_area    TP_area    FP_area    FN_area renamed_chains
     target     model2   0.621922     0.774894  1047.807935  941.514533 784.870893 193.071031 262.937041              .
     target     model1   0.507319     0.639249  1047.807935  792.834440 648.098138 331.779276 399.709796              .

Basic example using different structure readers

Below is an example script that is similar to the previous example script, but it shows how to input structures from different sources:

import cadscorelt

# init a CAD-score computation object
csc = cadscorelt.CADScoreComputer.init(subselect_contacts="[-inter-chain]")

# add a target structure read by Biotite
import biotite.structure.io
structure_target = biotite.structure.io.load_structure("./input/data/protein_homodimer1/target.pdb")
csc.add_target_structure_from_biotite(structure_target, "target")

# add a model structure read by Gemmi 
import gemmi
structure_model1 = gemmi.read_structure("./input/data/protein_homodimer1/model1.pdb")
csc.add_model_structure_from_gemmi(structure_model1[0], "model1")

# add a model structure read by Biopython 
import Bio.PDB
parser = Bio.PDB.PDBParser(QUIET=True)
structure_model2 = parser.get_structure("id", "./input/data/protein_homodimer1/model2.pdb").get_atoms()
csc.add_model_structure_from_biopython(structure_model2, "model2")

# get a list of global scores and convert it to pandas data frame
df_global_scores_residue_residue = csc.get_all_cadscores_residue_residue_summarized_globally().to_pandas()

# print the first rows of the data frame
cadscorelt.print_head_of_pandas_data_frame(df_global_scores_residue_residue)

Below is an example of the printed output:

target_name model_name  CAD_score  F1_of_areas  target_area  model_area    TP_area    FP_area    FN_area renamed_chains
     target     model2   0.621922     0.774894  1047.808013  941.514579 784.870926 193.071041 262.937087              .
     target     model1   0.507319     0.639249  1047.808013  792.834440 648.098141 331.779274 399.709873              .

Advanced example

Below is an example script that that includes:

import cadscorelt
from pathlib import Path

# to make comparison more strict, globally enable inclusion of residue names into atom and residue identifiers
cadscorelt.enable_considering_residue_names()

# init a CAD-score computation object, enable atom-atom contact scoring, enable automatic chain remapping to maximize grobal similarity, enable recording local scores
csc = cadscorelt.CADScoreComputer.init(subselect_contacts="[-inter-chain]", score_atom_atom_contacts=True, remap_chains=True, record_local_scores=True)

# set reference sequences and stoichiometry for automatic residue renumbering and chain namne assignment
csc.set_reference_sequences_from_file("./input/data/protein_heteromer1/sequences.fasta")
csc.set_reference_stoichiometry([2, 2, 2])

# input structures from all the files in a directory
input_directory = Path("./input/data/protein_heteromer1/structures")
for file_path in input_directory.iterdir():
    if file_path.is_file():
        csc.add_structure_from_file(str(file_path))

# get the table of structure decriptors and print its top rows
df_structure_descriptors = csc.get_all_structure_descriptors().to_pandas()
print("")
print(" # Table of structure decriptors:")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_structure_descriptors)
print("")

# get the table of globals scores based on residue-residue contacts, print top rows
df_global_scores_residue_residue = csc.get_all_cadscores_residue_residue_summarized_globally().to_pandas()
print("")
print(" # Table of globals scores based on residue-residue contacts:")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_global_scores_residue_residue)
print("")

# get the table of globals scores based on atom-atom contacts, print top rows
df_global_scores_atom_atom = csc.get_all_cadscores_atom_atom_summarized_globally().to_pandas()
print("")
print(" # Table of globals scores based on atom-atom contacts:")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_global_scores_atom_atom)
print("")

# set placeholder variable for structure names
target_name="cf_woTemplates_model_3_multimer_v3_pred_47"
model_name="cf_woTemplates_model_2_multimer_v3_pred_26"

# get the table of per-residue scores based on residue-residue contacts, print top rows
df_local_scores_per_residue = csc.get_local_cadscores_residue_residue_summarized_per_residue(target_name, model_name).to_pandas()
print("")
print(" # Table of per-residue scores based on residue-residue contacts:")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_local_scores_per_residue)
print("")

# get the table of scores for every residue-residue contact, print top rows
df_local_scores_residue_residue = csc.get_local_cadscores_residue_residue(target_name, model_name).to_pandas()
print("")
print(" # Table of scores for every residue-residue contact (CAD-score values of -1 idicate that the contact was not present in the target structure):")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_local_scores_residue_residue)
print("")

# get the table of per-atom scores based on atom-atom contacts, print top rows
df_local_scores_per_atom = csc.get_local_cadscores_atom_atom_summarized_per_atom(target_name, model_name).to_pandas()
print("")
print(" # Table of per-atom scores based on atom-atom contacts (CAD-score values of -1 idicate that the atom had no relevant contacts in the target structure):")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_local_scores_per_atom)
print("")

# get the table of scores for every atom-atom contact, print top rows
df_local_scores_atom_atom = csc.get_local_cadscores_atom_atom(target_name, model_name).to_pandas()
print("")
print(" # Table of scores for every atom-atom contact (CAD-score values of -1 idicate that the contact was not present in the target structure):")
print("")
cadscorelt.print_head_of_pandas_data_frame(df_local_scores_atom_atom)
print("")

Below is an example of the printed output:


 # Table of structure decriptors:

                                                    name  is_target  is_model          renamed_chains reference_alignment
                   afm_basic_model_5_multimer_v1_pred_35       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
            afm_dropout_full_model_1_multimer_v2_pred_42       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
            afm_dropout_full_model_2_multimer_v1_pred_65       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
            afm_dropout_full_model_3_multimer_v3_pred_64       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
            afm_dropout_full_model_3_multimer_v3_pred_66       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
 afm_dropout_full_woTemplates_model_3_multimer_v1_pred_4       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
afm_dropout_full_woTemplates_model_3_multimer_v1_pred_45       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
afm_dropout_full_woTemplates_model_4_multimer_v3_pred_50       True      True B=A,C=D,D=B,E=E,F=C,G=F           available
              cf_woTemplates_model_2_multimer_v3_pred_26       True      True A=A,B=D,C=B,D=E,E=C,F=F           available
              cf_woTemplates_model_3_multimer_v3_pred_47       True      True A=A,B=D,C=B,D=E,E=C,F=F           available


 # Table of globals scores based on residue-residue contacts:

                                 target_name                                               model_name  CAD_score  F1_of_areas  target_area  model_area     TP_area     FP_area     FN_area          renamed_chains
afm_dropout_full_model_1_multimer_v2_pred_42 afm_dropout_full_woTemplates_model_4_multimer_v3_pred_50   0.847662     0.701359  3970.007175 3836.593118 3582.890995 2664.103761  387.116180 A=A;B=B;C=F;D=D;E=E;F=C
  cf_woTemplates_model_3_multimer_v3_pred_47               cf_woTemplates_model_2_multimer_v3_pred_26   0.704688     0.780891  7483.973156 6871.838277 5912.094425 1745.848929 1571.878730 A=D;B=E;C=F;D=A;E=B;F=C
  cf_woTemplates_model_2_multimer_v3_pred_26               cf_woTemplates_model_3_multimer_v3_pred_47   0.699040     0.781187  7657.943354 6855.599895 5914.330867 1569.642289 1743.612487 A=A;B=B;C=C;D=D;E=E;F=F
afm_dropout_full_model_3_multimer_v3_pred_66             afm_dropout_full_model_3_multimer_v3_pred_64   0.657842     0.675283  7313.203511 5519.154809 5124.561234 2739.763621 2188.642277 A=A;B=B;C=C;D=D;E=E;F=F
afm_dropout_full_model_3_multimer_v3_pred_64             afm_dropout_full_model_3_multimer_v3_pred_66   0.620764     0.675283  7864.324855 5420.088717 5124.561234 2188.642277 2739.763621 A=A;B=B;C=C;D=D;E=E;F=F
afm_dropout_full_model_3_multimer_v3_pred_66               cf_woTemplates_model_2_multimer_v3_pred_26   0.575097     0.645935  7313.203511 5696.007986 4835.195167 2822.748187 2478.008344 A=A;B=E;C=C;D=D;E=B;F=F
  cf_woTemplates_model_2_multimer_v3_pred_26             afm_dropout_full_model_3_multimer_v3_pred_66   0.573808     0.645935  7657.943354 5446.078054 4835.195167 2478.008344 2822.748187 A=A;B=E;C=C;D=D;E=B;F=F
afm_dropout_full_model_3_multimer_v3_pred_66               cf_woTemplates_model_3_multimer_v3_pred_47   0.566239     0.627281  7313.203511 5493.415167 4640.996876 2842.976280 2672.206635 A=D;B=B;C=F;D=A;E=E;F=C
  cf_woTemplates_model_3_multimer_v3_pred_47             afm_dropout_full_model_3_multimer_v3_pred_66   0.561418     0.627332  7483.973156 5252.699687 4641.373085 2671.830426 2842.600071 A=A;B=E;C=C;D=D;E=B;F=F
  cf_woTemplates_model_2_multimer_v3_pred_26             afm_dropout_full_model_3_multimer_v3_pred_64   0.550970     0.603414  7657.943354 5255.694013 4683.177891 3181.146964 2974.765463 A=A;B=E;C=C;D=D;E=B;F=F


 # Table of globals scores based on atom-atom contacts:

                                 target_name                                               model_name  CAD_score  F1_of_areas  target_area  model_area     TP_area     FP_area     FN_area          renamed_chains
afm_dropout_full_model_1_multimer_v2_pred_42 afm_dropout_full_woTemplates_model_4_multimer_v3_pred_50   0.694870     0.612790  3970.007175 3631.663716 3130.437751 3116.557004  839.569423 A=A;B=B;C=F;D=D;E=E;F=C
afm_dropout_full_model_3_multimer_v3_pred_66             afm_dropout_full_model_3_multimer_v3_pred_64   0.600300     0.630986  7313.203511 5287.148112 4788.403033 3075.921822 2524.800478 A=A;B=B;C=C;D=D;E=E;F=F
afm_dropout_full_model_3_multimer_v3_pred_64             afm_dropout_full_model_3_multimer_v3_pred_66   0.566073     0.630986  7864.324855 5222.053418 4788.403033 2524.800478 3075.921822 A=A;B=B;C=C;D=D;E=E;F=F
  cf_woTemplates_model_3_multimer_v3_pred_47               cf_woTemplates_model_2_multimer_v3_pred_26   0.559002     0.635331  7483.973156 5931.794420 4810.063036 2847.880319 2673.910120 A=D;B=E;C=F;D=A;E=B;F=C
  cf_woTemplates_model_2_multimer_v3_pred_26               cf_woTemplates_model_3_multimer_v3_pred_47   0.550378     0.635281  7657.943354 5972.073003 4809.683681 2674.289475 2848.259673 A=A;B=B;C=C;D=D;E=E;F=F
afm_dropout_full_model_3_multimer_v3_pred_66               cf_woTemplates_model_2_multimer_v3_pred_26   0.455307     0.527325  7313.203511 5000.847832 3947.330056 3710.613299 3365.873455 A=A;B=E;C=C;D=D;E=B;F=F
  cf_woTemplates_model_2_multimer_v3_pred_26             afm_dropout_full_model_3_multimer_v3_pred_66   0.446372     0.527325  7657.943354 4761.917277 3947.330056 3365.873455 3710.613299 A=A;B=E;C=C;D=D;E=B;F=F
  cf_woTemplates_model_2_multimer_v3_pred_26             afm_dropout_full_model_3_multimer_v3_pred_64   0.445539     0.506868  7657.943354 4677.771912 3933.871243 3930.453612 3724.072111 A=A;B=E;C=C;D=D;E=B;F=F
  cf_woTemplates_model_3_multimer_v3_pred_47             afm_dropout_full_model_3_multimer_v3_pred_64   0.439929     0.489985  7483.973156 4453.132335 3760.215370 4104.109485 3723.757786 A=D;B=B;C=F;D=A;E=E;F=C
afm_dropout_full_model_3_multimer_v3_pred_66               cf_woTemplates_model_3_multimer_v3_pred_47   0.439601     0.499394  7313.203511 4526.494827 3694.810365 3789.162791 3618.393146 A=D;B=B;C=F;D=A;E=E;F=C


 # Table of per-residue scores based on residue-residue contacts:

ID_chain  ID_rnum ID_icode  CAD_score  F1_of_areas  target_area  model_area   TP_area   FP_area   FN_area
       A        4        .   0.000000     0.000000     5.723360    0.000000  0.000000 14.123374  5.723360
       A        6        .   0.000000     0.504191     7.178632   14.612032  7.160650 14.065207  0.017982
       A       15        .   0.000000     0.000000     1.346514    0.000000  0.000000  2.051384  1.346514
       A       17        .   0.175261     0.298250    25.529513    4.474315  4.474315  0.000000 21.055197
       A       18        .   0.378808     0.293841    26.519188   10.045678 10.045678 31.810069 16.473510
       A       19        .   0.715554     0.783253    69.951082   79.263636 63.086289 28.050569  6.864793
       A       20        .   0.228410     0.399188    97.824322   75.742435 35.484263 44.473835 62.340058
       A       21        .   0.892938     0.923627    55.431386   54.675708 52.086259  5.268696  3.345127
       A       22        .   0.000000     0.277170     0.401577    2.496120  0.401577  2.094543  0.000000
       A       23        .   0.406606     0.631095    34.355429   20.906229 17.437673  3.468556 16.917756


 # Table of scores for every residue-residue contact (CAD-score values of -1 idicate that the contact was not present in the target structure):

ID1_chain  ID1_rnum ID1_icode ID2_chain  ID2_rnum ID2_icode  CAD_score  F1_of_areas  target_area  model_area  TP_area   FP_area   FN_area
        A         4         .         D         6         .   0.000000     0.000000     0.043450    0.000000 0.000000  0.000000  0.043450
        A         4         .         D        61         .   0.000000     0.000000     5.679910    0.000000 0.000000  0.000000  5.679910
        A         4         .         D       206         .  -1.000000     0.000000     0.000000    0.000000 0.000000 14.123374  0.000000
        A         6         .         D         4         .   0.000000     0.000000     0.017982    0.000000 0.000000  0.000000  0.017982
        A         6         .         D        65         .   0.000000     0.657765     7.160650   14.612032 7.160650  7.451382  0.000000
        A         6         .         D       208         .  -1.000000     0.000000     0.000000    0.000000 0.000000  6.613825  0.000000
        A        15         .         C        31         .  -1.000000     0.000000     0.000000    0.000000 0.000000  2.051384  0.000000
        A        15         .         D       137         .   0.000000     0.000000     1.346514    0.000000 0.000000  0.000000  1.346514
        A        17         .         C        31         .   0.086301     0.158889    16.995494    1.466721 1.466721  0.000000 15.528773
        A        17         .         C        32         .   0.557286     0.715714     5.396860    3.007595 3.007595  0.000000  2.389265


 # Table of per-atom scores based on atom-atom contacts (CAD-score values of -1 idicate that the atom had no relevant contacts in the target structure):

ID_chain  ID_rnum ID_icode ID_atom_name  CAD_score  F1_of_areas  target_area  model_area  TP_area   FP_area   FN_area
       A        4        .           CD  -1.000000     0.000000     0.000000    0.000000 0.000000  0.589435  0.000000
       A        4        .           CE  -1.000000     0.000000     0.000000    0.000000 0.000000  6.218582  0.000000
       A        4        .           CG  -1.000000     0.000000     0.000000    0.000000 0.000000  0.040717  0.000000
       A        4        .           NZ   0.000000     0.000000     5.723360    0.000000 0.000000  7.274641  5.723360
       A        6        .           CB  -1.000000     0.000000     0.000000    0.000000 0.000000  0.005776  0.000000
       A        6        .          CG1  -1.000000     0.000000     0.000000    0.000000 0.000000 10.924598  0.000000
       A        6        .          CG2   0.315134     0.527725     7.178632    8.956361 4.610766  5.684717  2.567866
       A       15        .           CB  -1.000000     0.000000     0.000000    0.000000 0.000000  1.960049  0.000000
       A       15        .            O   0.000000     0.000000     1.346514    0.000000 0.000000  0.091335  1.346514
       A       17        .           CB   0.057452     0.097769    25.529513    1.466721 1.466721  3.007595 24.062792


 # Table of scores for every atom-atom contact (CAD-score values of -1 idicate that the contact was not present in the target structure):

ID1_chain  ID1_rnum ID1_icode ID1_atom_name ID2_chain  ID2_rnum ID2_icode ID2_atom_name  CAD_score  F1_of_areas  target_area  model_area  TP_area  FP_area  FN_area
        A         4         .            CD         D       206         .           OE1       -1.0          0.0     0.000000         0.0      0.0 0.589435 0.000000
        A         4         .            CE         D       206         .           NE2       -1.0          0.0     0.000000         0.0      0.0 1.398847 0.000000
        A         4         .            CE         D       206         .           OE1       -1.0          0.0     0.000000         0.0      0.0 4.819735 0.000000
        A         4         .            CG         D       206         .           OE1       -1.0          0.0     0.000000         0.0      0.0 0.040717 0.000000
        A         4         .            NZ         D         6         .           CG2        0.0          0.0     0.043450         0.0      0.0 0.000000 0.043450
        A         4         .            NZ         D        61         .            CZ        0.0          0.0     0.413695         0.0      0.0 0.000000 0.413695
        A         4         .            NZ         D        61         .            OH        0.0          0.0     5.266215         0.0      0.0 0.000000 5.266215
        A         4         .            NZ         D       206         .            CD       -1.0          0.0     0.000000         0.0      0.0 0.008074 0.000000
        A         4         .            NZ         D       206         .           NE2       -1.0          0.0     0.000000         0.0      0.0 5.372747 0.000000
        A         4         .            NZ         D       206         .           OE1       -1.0          0.0     0.000000         0.0      0.0 1.893820 0.000000