Parse various raw sources to produce input data for descriptor calculator.
Takes in as input either from prosite or ioncom. To start, we need the pdb code and chain id for identified sequences.
For prosite, this can be obtained using the html extract and a manually copy-pasted pdb list. The latter provides the pdb code, while the former allows for matching between pdb code and chain id.
For ioncom, the pdb code-chain id matching is derived directly from the downloaded test dataset.
Requires compiling of converge, and meme-suite, both in src
. Call make converge
in converge folder, and ./configure --prefix=$PWD --with-url=http://meme-suite.org/ --enable-build-libxml2 --enable-build-libxslt
make
make install
in meme-suite. Binaries for
both cannot be used because absolute path-ing done in compile stage (not sure
for converge). meme-suite requires mpicc
for parallel runs.
Workflow
Unless otherwise stated, functions come from src/preprocess.py
.
-
Extract seq-name and chain-ID from source extracts
- Input:
- Website extract files (e.g.
data/input/prosite_extract.txt
)
- Website extract files (e.g.
- Output:
data/internal/pname_cid_map.pkl
- Description:
- Place prosite or ioncom extract in
/data/input/
. - Run
parse_extract_prosite()
orparse_extract_ioncom()
.
- Place prosite or ioncom extract in
- Input:
-
(Optional) Download relevant
.pdb
files from rscb server- Input:
data/internal/pname_cid_map.pkl
- Internet connection
- Output:
- Populated
data/internal/pdb_files/
- Populated
- Description:
Downloads corresponding.pdb
files from rscb server. Delete entries inpname_cid_map
if.pdb
files are not in folder.- Run
download_pdb()
- Run
trim_pnames_based_on_pdb()
- Run
- Input:
-
Create sequence
.fasta
file- Input:
data/internal/pname_cid_map.pkl
- Output:
data/internal/seqs.fasta
- Description:
The motif-finding binaries require the sequences to be in a.fasta
file.- Run
create_seq()
- Run
- Input:
-
Filter short sequences
- Input:
data/internal/seqs.fasta
- Populated
data/internal/pdb_files/
- Output:
- Updated
data/internal/seqs.fasta
- Updated
- Description:
Sequences shorter than the desired motif length (30 residues) can lead to errors when performing the motif search, and need to be dropped.- Run
filter_seq_file()
- Run
- Input:
-
(Optional) Create seed sequence file for
converge
- Input:
data/input/ioncom_binding_sites.txt
- Output:
data/internal/seed_seqs.fasta
- Description:
The motif-finding binaryconverge
requires seed sequences from which it generates its initial set of motifs.- Place ioncom binding-site file in
/data/input/
. - Run
make()
insrc/make_conv_seed_seqs.py
.
- Place ioncom binding-site file in
- Input:
-
Run motif-search binary to find motif positions
-
Input:
data/internal/seqs.fasta
- (Optional) Populated
data/internal/pdb_files/
- (Optional) Provided motif file (e.g.
data/user/input/meme.txt
) - (Optional)
data/internal/seed_seqs.fasta
-
Output:
data/internal/motif_pos.pkl
-
Description:
This finds the positions of the desired motif for each sequence-chain. There are three implemented ways of running this locally:- Motifs can be derived from scratch, using
meme
. This generates both the motif file and the motif positions. Runfind_motifs_meme()
- Motifs can be found using a given motif file. First, put the motif
file (in MEME format) in
data/input/<filename>
. Then, runfind_motifs_mast()
- Motifs can be derived from scratch using
converge
, which also provides the motif file and positions. Runmake (input_fname=<filename>, num_p=<num_processors>)
insrc/make_conv_seed_seqs.py
.
The motif-finding process takes a while. Instructions to run it separately are in [1] below.
- Motifs can be derived from scratch, using
-
Tests
-
Generate Reference Output
/tests/src/setup_ref.py
-
Visualise Reference Output
/tests/src/plot_ref.py
-
Checks against reference output
/tests/src/test_preprocessing.py
Data files
/data
/debug
: created during runtime, should be deleted at end of run, except for debugging./input
ioncom_extract.txt
: Raw sequence-binding_site match, for mg, in dataIonCom.zip, downloaded from https://zhanglab.ccmb.med.umich.edu/IonCom/ >> download dataset used to...ioncom_binding_sites.txt
: allid_reso3.0_len50_nr40.txt in dataIonCom, shows list of sequences.mg_50.fasta
: From uniprot, uniref50 for seqs with MG as co-factor/ligand.mg_100.fasta
: uniref100 for MG cofactor seqspdb_files
: Stored pdb_files. Both tests and main should call this, since downloading takes a while. Automatically downloaded from rscb server, via link https://files.rcsb.org/view/{1ABC}.pdbprosite_extract.txt
: Copy-pasted from html (inspect source code) from prosite website (https://prosite.expasy .org/cgi-bin/pdb/pdb_structure_list.cgi?src=PS00018).ref_meme.txt
: Motif file for Calcium EF-hand.
Linter
pylint
, mostly following google style guide with some additional disabled clauses.
To build sequence logo for final UI, take the descr.pkl from
Descriptor_Calculator, and see build() in seq_logo.py. After that is done, run
./src/meme_suite/meme/src/ceqlogo -i ./gxggxg_descr.txt -m motif_name -o gxggxg_logo.eps
to generate the logo figure in .eps format. Finally, go to
https://www.epsconverter.com/ to convert it into .png form with white
background. Move to Descriptor_Calculator/src/ui/static with the
appropriate filename (see IndividualFigure for that) to load it in.
Todo:
- Add compilation instructions for
converge
andmeme-suite
.