• October 1st, 2024: we are pleased to announce the release of our new Web server: FoldScript
• October 31st, 2024: more than 1500 jobs processed in the first month. Keep spreading the word about FoldScript!
Do not hesitate to contact us (espript@ibcp.fr) if you require any further information or if you need some help with FoldScript.
Preamble
In addition to this documentation, you have access to a demonstration session by clicking on the button below. The original PDB model files for this session can be obtained separately by clicking here. They consist of a heterodimer of the IN and RT proteins of the Human Immunodeficiency Virus 1 produced by AlphaFold 3.
3D modeling programs using artificial intelligence (AI) have revolutionized structural biology by making it possible to predict protein structures with an unprecedented level of confidence, whether in monomeric or multimeric states and, for some software, in complex with nucleic acids, ligands or ions.
However, these predicted structures remain models and divergences in predictions can be observed between distinct AI modeling algorithms (e.g.AlphaFold 2 (1), AlphaFold 3 (2), RoseTTAFold (3), ESMFold (4)) as reported in this article by Guillon et al. (5).
AlphaFold 2 and 3 generate, by default and for the same run, respectively 25 and 5 different models ranked with a confidence score. Faced with this quantity of data, the task of examining and comparing models is time-consuming and laborious. Worse, in our experience, many users tend to only retain the top-ranked model.
However, it is judicious not to limit ourselves to the model with the best score and, notably, to introduce previously known experimental data into the choice of the most relevant model. Thus, it often appears that the "best" solution proposed by AlphaFold is not always the most reliable and basing our choice solely on confidence scores is far too simplistic. Indeed, to date, modeling programs do not consider some information that the experimenter knows (intermolecular contact zones, residues interacting with a ligand, active site, etc.).
The FoldScript server was created to respond effectively and rationally to these questions. In this goal, FoldScript performs an automated and detailed analysis of the structural information of the multi-models set produced by an AlphaFold 2 or 3 run. It synthesizes, in a comparative and intelligible flat figure, the primary to quaternary structural information of the models to guide the user in his decision making. In addition, this analysis can be refined by introducing previously known interaction data in order to identify the most relevant model(s).
FoldScript thus allows a large community, specialist or not in structural biology, to easily analyze datasets of 3D models generated by AI and offers rational support in finding the "real best model(s)".
FoldScript complements our ESPript and ENDscript Web servers to enable exhaustive analysis of experimental or predicted protein structures (6-7).
Overview of the FoldScript pipeline
In order to extract as much relevant information as possible about the models, FoldScript chains together several sequence and structure analysis programs:
MAXIT, an RCSB program to assist in the processing and curation of CIF/PDB structure data.
SPDB, a homemade program to extract sequence, check residue numbering and chain IDs and parse structural components from the query model files.
DSSP (8,9), to extract secondary structure elements, disulfide bridges and solvent accessibility per residue.
CNS (10), to determine protein:protein and protein:ligand contact distances, if present.
BLAST+ (11), to search protein homologues using the sequence of the query against a chosen sequence database (UniProtKB/Swiss-Prot or PDBAA).
Clustal Omega (12), to perform multiple sequence alignment on sequence hits from the previous BLAST+ search.
ESPript (6-7), to gather and render all this information with a detailed flat figure.
NGL Viewer (14), to prepare and display interactive 3D views.
The results are presented at the end of this pipeline. A graphical Web interface allows you to customize the representation produced and to obtain additional information (several types of interactive 3D visualization of the models, tables and histogram of intermolecular contacts, a tool to determine the most likely model(s) based on prior knowledge of inter- or intra-molecular contacts).
Uploading your AI models to FoldScript
FoldScript uses 3D coordinates PDB or CIF files produced by AlphaFold 2 or 3 as input. (e.g. the 25 or 5 models generated by default by AlphaFold 2 or 3, respectively)
On the upload page, drag and drop your model files on the grey area or click on it to browse these last from your computer:
All entry files must correspond to a modeling run conducted with the same components. They must have the same number of chains, identical sequences and the same modified residue(s) / nucleic acid(s) / ligand(s) or ion(s). Otherwise, FoldScript will produce an error message.
File names should match the following pattern: ranked_# (with '#' a number). (example: ranked_0.pdb, ranked_1.pdb, ranked_2.pdb, etc.) If it is not the case, the uploaded files will be automatically renamed and renumbered according to this pattern. (e.g. ranked_0 to ranked_24)
Monomer, homo- or hetero-multimer models are accepted, with or without modified residue(s) / nucleic acid(s) / ligand(s) / ion(s).
25 model files maximum can be uploaded.
Once all the files have been uploaded to the server (all progress bars colored green), click on the blue 'SUBMIT' button to launch the FoldScript analysis.
The progress of the data analysis is displayed as shown in the figure opposite. This first run generally lasts less than 30 seconds. The calculation time may be longer in the presence of a huge number of intermolecular contacts and/or of large multimeric complexes.
Stay on the page and wait, the results page will be displayed at the end of the process.
Presentation of the results
Once this calculation step is complete, the results page appears. The latter is divided into two parts (see screenshot below):
On the left, the FoldScript representation panel summarizing the information gathered from the analysis (see below).
On the right, a vertical toolbar for customizing the FoldScript representation or displaying additional results and tools (see "FIGURE OPTIONS toolbar" section).
Depending on your browser, you can perform some operations on the figure thanks to the top icon bar of the figure viewer: zoom-in, zoom-out, horizontal or vertical scrolling, print the figure, save it in PDF format, add annotations, etc. Please refer to your browser manual for more information on these possibilities.
The default FoldScript representation (see an excerpt below) depicts from A. (top) to E. (bottom):
A. For each uploaded model (sorted in alphanumeric order), secondary structure elements are extracted from the PDB/CIF query files. Each model is named according to the following convention: Prefix ('ranked_' by default), a number (corresponding to the rank), an '_' character followed by an uppercase letter designating the selected chain ID of the model ('A' by default).
α-, 310- and π-helices are shown as medium, small and large squiggles with α, β and π labels, respectively.
β-strands are shown as arrows labeled β.
Strict α- and β-turns are marked by TTT and TT letters, respectively.
In addition, if models were predicted with AlphaFold, FoldScript can color secondary structure elements according to the local modeling confidence score called pLDDT, which range from 0 to 100:
Regions with pLDDT > 90 (colored in blue) are expected to be modelled to high accuracy. They could be suitable for any application that benefits from such precision (e.g. characterizing a binding site).
Regions with pLDDT between 70 and 90 (colored in cyan) are expected to be modelled well (a generally good backbone prediction).
Regions with pLDDT between 50 and 70 (colored in yellow) are low confidence and should be treated with caution.
Regions with pLDDT < 50 (colored in orange) often have a ribbon-like appearance and should not be interpreted. These regions could also be a reasonably strong predictor of disorder.
B. Below the representation of the secondary structure elements is a multiple sequence alignment produced by Clustal Omega, which includes the query model sequence and, by default, the top 10 homologous sequences identified by a BLAST+ search against the UniprotKB/Swiss-Prot database.
This alignment is colored according to residue conservation. A percentage of equivalent residues is calculated per columns considering physico-chemical properties with a threshold set at 70%: HKR are polar positive, DE are polar negative, STNQ are polar neutral, AVLIM are non-polar aliphatic, FYW are non-polar aromatic.
As a result, residue letters are written in white on a red background in case of strict identity (100% similarity); in red on a yellow background if the score is in the range 70-99% (significant similarity); in black if the score is below 70% (low similarity).
C.The relative accessibility (labelled 'acc') calculated by DSSP for each residue of the top-ranked model is shown with a colored bar below sequence blocks: white is buried, cyan is intermediate, blue is accessible and blue with red borders is highly exposed.
D.Hydropathy (labelled 'hyd') calculated from the query sequence according to the algorithm of Kyte & Doolittle (15) is shown by a second colored bar below accessibility: pink is hydrophobic, grey is intermediate and cyan is hydrophilic.
E.Protein:protein intermolecular contact distances and/or protein:ligand contact distances calculated by CNS are presented for the selected protein chain ('A' by default).
A to Z or a to z means that the concerned amino acid residue has a contact with an amino acid residue of the chain A to Z or a to z. Contact letters or symbols (see the table below) are written in red if the shortest contact distance is < 3.2 Å and in black if the shortest contact distance is in the range 3.2 - 3.7 Å. For example, in the FoldScript figure above, chain 'B' is selected ('ranked_*_B') and, for the top-ranked model (contact line labelled 'ranked_0_B'), residue W132 of chain B is in contact with a residue of chain A with a shortest contact distance < 3.2 Å (red A letter).
# identifies a contact between two amino acid residues having the same names and numbers (e.g. along 2-fold axis).
Disulfide bridges are shown by pairs of digits/letters, colored green in the case of an intramolecular bridge (1 1) or cyan in the case of an intermolecular bridge (1 1). Thus, two cysteine residues linked by a disulfide bridge will be marked with the same number.
* : " + ^ symbols means that the concerned amino acid residue has a contact with a ligand. FoldScript only supports a set of common ligands (see table below) that are depicted by given symbols:
Hetero-compound type
Name
Symbol
Nucleotides
ADE GUA CYT THY URI A G C T U DA DG DC DT
*
Porphyrin groups
HEM BCL BPH MQ7
:
Sugars
GLC GAL MAN NAG FUC SIA XYL
"
Ions
CA CO CU FE K MG MN N ZN CL
+
Miscellaneous
NAD NAH NDP NAP FMN
^
Note: to date, not all ligands proposed by AlphaFold 3 are supported. Ligands not present in the table above will not be considered in the analysis performed by FoldScript.
Further information is given with colors:
A red letter / symbol identifies a contact < 3.2 Å.
A black letter / symbol identifies a contact between 3.2 Å and 3.7 Å.
A blue frame identifies an amino acid residue involved in both a protein:protein and a protein:ligand contact.
"FIGURE OPTIONS" toolbar
On the right part of the result page, a vertical toolbar allows you to customize the FoldScript representation at your convenience with several options.
For all of them, a tooltip icon provides a help text when hovering the mouse over it.
To apply one or more option modifications, you must regenerate the figure by clicking on the blue "UPDATE FIGURE" button located at the bottom of the toolbar. A short calculation step is launched and the updated figure is then displayed.
Query rename: you can rename your query with another prefix ('ranked_' by default). You can use up to 15 characters with only alphanumeric characters plus the symbols - and _. Note that by hovering the mouse over the tooltip icon next to this option, you can view the name mapping between your uploaded model files and the names shown in the FoldScript representation.
Show / hide entries: by clicking on this button, a sliding panel appears allowing you to choose whether or not to display some of the 3D models you have uploaded. Only checked entries will be considered in the subsequent analyses and representations. At least one entry must be selected.
Model's chain used for analysis: this option is only available if the uploaded models are of "multimer" type. It allows you to choose the model's chain ID used in the subsequent analyses and representations. By default, chain A is automatically selected.
Search, align and display homologous protein sequences: if enabled, this option displays a multiple sequence alignment of homologous proteins to the query colored according to residue conservation. In this goal, a BLAST+ search is performed against the UniProtKB/Swiss-Prot (default) or the PDBAA database to find homologues. Hits are piped to Clustal Omega in order to obtain a multiple sequence alignment of the query with these homologous sequences. The resulting alignment is colored according to the degree of similarity (see "Presentation of the results" section).
This "Search, align and display homologous protein sequences" option is coupled with the following three:
Sequence database: defines the sequence database screened by the BLAST+ search. Two choices are possible: SwissProt (Swiss-Prot database from the UniProt Knowledgebase) or PDBAA (sequences derived from the PDB, the experimentally-determined 3D structures database).
E-value: sets the threshold for retaining sequence matches identified by the BLAST+ search. The E-value gives an indication of the statistical significance of a given pairwise alignment. Thus, the lower the E-value is (the closer it is to zero), the more significant the matches will be.
Maximum number of homologous sequences displayed: at the end of the BLAST+ search, only this maximum number of sequences is retained before being then aligned with Clustal Omega and finally represented. You can choose a value between 5 and 25.
Display Settings: this drop-down menu allows you to display or hide several FoldScript representation elements listed below:
Display secondary structure elements: if enabled, displays secondary structure elements of each uploaded model file above the sequences block.
Secondary structure AlphaFold confidence color scheme: colors secondary structure elements according to the pLDDT confidence score. AlphaFold produces a per-residue estimate of the modelling confidence called pLDDT. If enabled, regions with pLDDT > 90 (very high model confidence) are colored in blue; regions between 70 and 90 (confident) in cyan; regions between 50 and 70 (low confidence) in yellow and regions < 50 (very low confidence) are colored in orange. If option is set to 'OFF', secondary structure elements are all colored in black.
Homology displayed only on the query sequence: by activating this option, multiple sequence alignment is no longer displayed while homology information extracted from the BLAST+/Clustal Omega process is color-coded on the models' query sequence only (see figure below).
Display relative accessibility: if enabled, displays relative accessibility calculated by the program DSSP for the top-ranked model. It is rendered as a colored bar (labelled 'acc') located below the sequences block (blue, accessible; cyan, intermediate; white, buried).
Display hydropathy: if enabled, displays hydropathy calculated according to the algorithm of Kyte & Doolittle for the query sequence. It is rendered as colored bar (labelled 'hyd') located below the sequences block (pink, hydrophobic; grey, intermediate; cyan, hydrophilic).
Display intermolecular contacts: this option is only available if your AlphaFold model is a multimer and/or contains nucleic acid(s) / ligand(s) / ion(s). It allows the display of protein:protein and/or a protein:ligand contacts. They are displayed below the sequences block or, if enabled, below the bars of accessibility / hydropathy. The reference chain is the one selected with the "Model's chain used for analysis" option.
Contact ratings are depicted as follows:
A to Z or a to z means that the concerned amino acid residue has a contact with an amino acid residue of the chain A to Z or a to z. Contact letters or symbols (see the table below) are written in red if the shortest contact distance is < 3.2 Å and in black if the shortest contact distance is in the range 3.2 - 3.7 Å.
# identifies a contact between two amino acid residues having the same names and numbers (e.g. along 2-fold axis).
Disulfide bridges are shown by pairs of digits/letters, colored green in the case of an intramolecular bridge (1 1) or cyan in the case of an intermolecular bridge (1 1). Thus, two cysteine residues linked by a disulfide bridge will be marked with the same number.
* : " + ^ symbols means that the concerned amino acid residue has a contact with a ligand. FoldScript only supports a set of common ligands (see table below) that are depicted by given symbols:
Hetero-compound type
Name
Symbol
Nucleotides
ADE GUA CYT THY URI A G C T U DA DG DC DT
*
Porphyrin groups
HEM BCL BPH MQ7
:
Sugars
GLC GAL MAN NAG FUC SIA XYL
"
Ions
CA CO CU FE K MG MN N ZN CL
+
Miscellaneous
NAD NAH NDP NAP FMN
^
Note: to date, not all ligands proposed by AlphaFold 3 are supported. Ligands not present in the table above will not be considered in the analysis performed by FoldScript.
Further information is given with colors:
A red letter / symbol identifies a contact < 3.2 Å.
A black letter / symbol identifies a contact between 3.2 Å and 3.7 Å.
A blue frame identifies a residue involved in both a protein:protein and a protein:ligand contact.
Numbering at the beginning of each block of sequences: if enabled, all sequences are numbered at the beginning of each block. By default ('OFF' selected), the first sequence is numbered every ten residues by markers placed above it.
Add sequence markers: this drop-down form allows you to add user-supplied markers below the accessibility / hydrophobicity bars (if displayed). You can add up to 5 different sets, each defined by a symbol (5 possible choices), a color (among 10 possible) and a position. The position is given with the following syntax: 5-10 103 110-112 153 will add a marker at residues 5 to 10, 103, 110 to 112 and finally 153.
Number of columns: this option defines the number of residue columns per line in the representation, including the sequence naming area on the left. Note that this value will be automatically decreased if it results in the alignment not being able to fit entirely within the width of the page (which may be the case when using a large font size - see option below).
Font size: this option defines the size for the 'Courier' font used in the secondary structure elements / sequence representation. Three sizes are available: Small, Medium or Large.
Output picture format: in addition to the default PDF output format, you can generate figures in JPG or TIFF for presentation or publication purposes. If 'OFF' is selected, the FoldScript representation will only be available in PDF format.
Output picture resolution (in DPI): you can render the JPG or TIFF output pictures (if requested) in three resolutions (150, 300 or 600 DPI). High DPI resolutions (300 or 600) are only recommended for publication-quality figures.
"RESULT FILES & DATA VISUALIZATION" toolbar
This second toolbar, located under the "UPDATE FIGURE" button, allows you to access other tools for displaying or analyzing your models:
"Download in PDF / View fullscreen" button: it allows you, depending on your browser configuration, to download the representation generated by FoldScript in PDF format or to display it in full screen. This button can be accompanied by a second one (named either "Download in JPG" or "Download in TIFF") in the case where you have requested an output in one of these formats through the "Output picture format" option (see above).
"Display Intermolecular statistics" button:this tool is only available if the models are of "multimer" type. This button opens a new window (see screenshot below) which allows you to obtain statistics on residues involved in intermolecular contacts. This may help you to select the most likely model(s) with the constraint of previously known intermolecular contacting residues or regions. The reference chain ID is the one selected via the "Model's chain used for analysis" option.
A.On the left, a table presents, for each residue detected to be involved in an intermolecular contact, how many models contain this contact. Only selected models are considered (see "Show / hide entries" option above) while all chain IDs are taken into account.
B.On the right, a histogram graphically presents this same information. By hovering the mouse over a bar of the histogram, you can detail for a given residue the number of models where such an intermolecular contact occurs. Finally, you can zoom in on a part of the histogram by clicking and selecting the desired area with the mouse.
C.At the bottom left of the window, the "Enter residue(s) known to be involved in intermolecular contacts" tool allows you, on the basis of residues that you experimentally know to be in contact, to sort the models according to the number of satisfied contacts.
Provide these residues with the following syntax: 231 242 243 247 248 250-252 258 (this will include residues 231, 242, 243, 247, 248, 250 to 252 and finally 258 for calculation).
If you tick the "Use side-chains only" box, only contacts implicating side-chains atoms will be considered. If not, the contacts determined by the backbone atoms will also be included.
D. After clicking on the blue "Submit" button, the ranking of models satisfying contact with the given residues will be presented in the bottom right result panel, from best to worst. For each model, a percentage of agreement is given as well as the list of residues involved in a contact. Finally, residues given by the user are highlighted in red on the contacts histogram.
E. Finally, the "View detailed contact tables" button opens a new browser tab presenting, for each model, a comprehensive contact table between the current chain ID and the other chains / ligands of the models (see screenshot below). Residues entered in the "Enter residue(s) known to be involved in intermolecular contacts" box are highlighted in red. As an example, below we are interested in the list of intermolecular contacts involving the chain B of the ranked_6 model:
"Display 3D models" button: by clicking this button, the FoldScript representation is replaced by the NGL viewer (14) which allows you to interactively examine all the models uploaded to FoldScript.
On the upper part of the window (A), you can select the model to display by clicking on the appropriate button.
You can interact with the 3D representation (B) using the mouse:
- Rotate: left click+drag / Translate: right click+drag / Zoom: shift+left click (or scroll wheel) / Slab: shift+scroll wheel.
- Clicking a residue or atom will zoom in and center the view on that site.
The button at the bottom left of the viewport (C) allows you to export the current 3D view to a PNG image file.
The button at the bottom right (D) allows you to center and reset the view of the 3D structure.
The 3D viewer offers several representation schemes that can be displayed by clicking on the round buttons located at the top right (E):
Color by confidence:models are represented in "cartoon" mode and colored according to the AlphaFold pLDDT score (a per-residue estimate of the model confidence). Regions with pLDDT > 90 (very high model confidence) are colored in blue; regions between 70 and 90 (confident) in cyan; regions between 50 and 70 (low confidence) in yellow and regions < 50 (very low confidence) are colored in orange. If present, nucleic acids, ligands and ions are represented in cartoon, ball & stick and sphere modes, respectively. This representation allows you to quickly distinguish areas of the model which have good prediction confidence from those which are of low or poor confidence.
Color by chains:models are represented in "cartoon" mode and colored according to their chain ID. If present, nucleic acids, ligands and ions are represented in cartoon, ball & stick and sphere modes, respectively. This representation is very useful for distinguishing chains from each other in multimeric models.
Spacefill view:models are represented in sphere mode and colored according to their chain ID. If present, nucleic acids, ligands and ions are represented in cartoon, ball & stick and sphere modes, respectively. This representation allows you to better visualize the real occupancy of the atoms and therefore the intermolecular contact zones (protein:protein and/or protein:ligand).
Highlight markers: (screenshot above) this option is only available when you have defined markers with the "Add sequence markers" option of the vertical toolbar (see "FIGURE OPTIONS toolbar" section). Models are represented in "cartoon" mode and are colored according to their chain ID. If present, nucleic acids, ligands and ions are represented in cartoon, ball & stick and sphere modes, respectively. The side-chains of all residues are presented in light grey wire. The residues which have been marked in the "Add sequence markers" tab are presented in stick and colored as defined with the sequence markers color option. This representation is recommended if you want to highlight some residues and easily locate them on each of your models.
"Display models' mosaic" button: this button allows you to display, in a new tab of your browser, a mosaic of snapshots of all the models uploaded to FoldScript. For this purpose, these last were superimposed on the top-ranked one, taking chain A as a fixed reference. Thus, in this figure, all the models have the same orientation. This overall view may allow to identify model families sharing common fold or spatial arrangement on the basis of their structural resemblance. This allows you to sort models not by their confidence score, but by their overall fold which can be a supplementary decision-making aid.
"Save session" button: this button allows you to download a backup file containing all the models, information, representations and settings of your current FoldScript session. This way, you can resume your analysis whenever you want from the state you left it in. Please note that every session will be permanently deleted from the server after one hour of inactivity.
To restore a session, simply go to the FoldScript upload page and click on the "Restore a session" tab. Click on the gray banner or drag and drop your session file. Once it has been uploaded, click on the "Submit" button and your session will be instantly restored in the state in which you saved it.
"Start new session" button: this button allows you to open a new FoldScript session (models upload window) in another tab of your browser and, thus, have several independent sessions in parallel.
References
1.
Jumper J., et al. (2021) Nature, 596, 583-589
2.
Abramson J., et al. (2024) Nature, 630, 493-500
3.
Baek M., et al. (2021) Science, 10.1126/science.abj8754
4.
Lin Z., et al. (2023) Science, 379, 1123-1130
5.
Guillon C., Robert X. & Gouet P. (2024) Pathogens, 13, 241
6.
Gouet P., Robert X. & Courcelle E. (2003) Nucleic Acids Res., 31, 3320-3323
7.
Robert X. & Gouet P. (2014) Nucleic Acids Res., 42, W320-W324
8.
Touw W.G., et al. (2015) Nucleic Acids Res., 43, D364-D368
9.
Kabsch W. & Sander C. (1983) Biopolymers, 22, 2577-2637
10.
Brünger A.T., et al. (1998) Acta Cryst., D54, 905-921
11.
Camacho C., et al. (2008) BMC Bioinformatics, 10, 421
12.
Sievers F. & Higgins D.G. (2018) Protein Sci., 27, 135-145
13.
The PyMOL Molecular Graphics System, Schrödinger, LLC., https://www.pymol.org
14.
Rose A.S., et al. (2018) Bioinformatics, 10.1093/bioinformatics/bty419
15.
Kyte J. & Doolittle R.F. (1982) J. Mol. Biol., 157, 105-132