{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/edikedik/eBoruta/blob/master/notebooks/sequence_determinants_tutorial.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finding STk vs. Tk sequence determinants\n",
    "\n",
    "In this example, we'll tackle a problem of multiple sequence alignment (MSA)\n",
    "sequences' classification.\n",
    "\n",
    "[Protein kinases](https://en.wikipedia.org/wiki/Protein_kinase) (PKs), or phosphate transferases, are regulatory enzymes ubiquitous across all major known taxae. They catalyze transferring of phosphate moieties\n",
    "from an ATP molecule to the carboxylic group of an amino acid residue. This changes physicochemical properties of a target protein, altering it's overall behavior and functionality. Thus, they act as molecular controllers, deeply\n",
    "embedded into regulating mechanisms governing complex cellular machinery.\n",
    "\n",
    "Over the course of evolution, along with substrate specificity, PKs developed a preference towards certain amino acid residues (Fig. 1). The principal group of PKs that likely appeared initially were PKs transferring phosphate to serine and threonine residues (STk). Later, another group emerged, specializing towards the tyrosine residue (Tk).\n",
    "\n",
    "| ![Phosphorylation](../fig/Phosphorylation.png \"Phosphorylation reaction\") |\n",
    "|:--:|\n",
    "| <b> Fig. 1. Phosphorylation of Ser/Thr and Tyr. Notice the dissimilarities between these two groups </b>|\n",
    "\n",
    "What could be the sequence differences enabling such a specialization? Could we find them using features selection machinery?\n",
    "\n",
    "In 1994, [Juswinder performed](https://pubmed.ncbi.nlm.nih.gov/7971947/) a conservation-based analysis of STk and Tk sequences (use [this link](https://drive.google.com/file/d/19uvEatJwQ4ZMmdTeMpFj-QM7AnDkhTiR/view?usp=share_link) for a full paper). He discovered several MSA positions that, as structural evidence suggests, exploit physicochemical differences (e.g., volume) between Ser/Thr and Tyr. For this, he used a scoring function that would spot positions with substantial but different conservation between STk and Tk sequences in the same multiple sequence alignment.\n",
    "\n",
    "---\n",
    "\n",
    "| ![PK domains alignment](../fig/juswinderA.png \"Juswinder et al. MSA (A)\") |\n",
    "|:--:|\n",
    "| <b> Fig. 2. Positions of STk ad Tk sub-alignment (note 201 and 203) </b> |\n",
    "\n",
    "---\n",
    "\n",
    "| ![PK domains alignment](../fig/juswinderB.png \"Juswinder et al. MSA (B)\") |\n",
    "|:--:|\n",
    "| <b> Fig. 2. Positions of STk and TK sub-alignment part2 (note 168 and 170) </b> |\n",
    "\n",
    "---\n",
    "\n",
    "Here, we'll attempt to recover positions highlighted by Juswinder on a larger\n",
    "MSA built from reviewed [UniProt](https://www.uniprot.org) sequences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install dependencies"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:11.554452Z",
     "start_time": "2024-09-25T17:07:11.550672Z"
    }
   },
   "source": "# ! pip install lXtractor eBoruta ipywidgets xgboost",
   "outputs": [],
   "execution_count": 3
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Prepare data"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:11.574097Z",
     "start_time": "2024-09-25T17:07:11.571783Z"
    }
   },
   "source": [
    "import gzip\n",
    "import re\n",
    "from collections import Counter\n",
    "from io import StringIO, BytesIO\n",
    "from itertools import chain\n",
    "\n",
    "import lXtractor.chain as lxc\n",
    "import lXtractor.variables as lxv\n",
    "import pandas as pd\n",
    "from lXtractor.ext import PyHMMer\n",
    "from lXtractor.util import fetch_text\n",
    "from more_itertools import consume"
   ],
   "outputs": [],
   "execution_count": 4
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Download and parse initial sequences and profile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The UniProt link is generated from the search API.\n",
    "We are taking all reviewed proteins (Swiss-Prot database) belonging to the PK superfamily.\n",
    "\n",
    "Link to Pfam points to the _PF00069_ -- a protein kinase domain HMM profile."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:11.624207Z",
     "start_time": "2024-09-25T17:07:11.619638Z"
    }
   },
   "source": [
    "LINK_UNIPROT = \"https://rest.uniprot.org/uniprotkb/stream?fields=accession%2Cid%2Csequence%2Cft_domain%2Cprotein_families&format=tsv&query=%28%28family%3A%22protein%20kinase%20superfamily%22%29%29%20AND%20%28reviewed%3Atrue%29\"\n",
    "LINK_PFAM = 'https://www.ebi.ac.uk/interpro/wwwapi//entry/pfam/PF00069?annotation=hmm'"
   ],
   "outputs": [],
   "execution_count": 5
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:16.269311Z",
     "start_time": "2024-09-25T17:07:11.670236Z"
    }
   },
   "cell_type": "code",
   "source": "text = fetch_text(LINK_UNIPROT)",
   "outputs": [],
   "execution_count": 6
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:16.324189Z",
     "start_time": "2024-09-25T17:07:16.290319Z"
    }
   },
   "source": "df_up = pd.read_csv(StringIO(text.decode('utf-8')), sep='\\t')",
   "outputs": [],
   "execution_count": 7
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:16.359255Z",
     "start_time": "2024-09-25T17:07:16.352909Z"
    }
   },
   "source": [
    "df_up.head()"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "        Entry   Entry Name                                           Sequence   \n",
       "0  A0A075F7E9  LERK1_ORYSI  MVALLLFPMLLQLLSPTCAQTQKNITLGSTLAPQGPASSWLSPSGD...  \\\n",
       "1  A0A078CGE6  M3KE1_BRANA  MARQMTSSQFHKSKTLDNKYMLGDEIGKGAYGRVYIGLDLENGDFV...   \n",
       "2  A0A0B4J2F2  SIK1B_HUMAN  MVIMSEFSADPAGQGQGQQKPLRVGFYDIERTLGKGNFAVVKLARH...   \n",
       "3  A0A0K3AV08   MLK1_CAEEL  MEQASVPSYVNIPPIAKTRSTSHLAPTPEHHRSVSYEDTTTASTST...   \n",
       "4  A0A0P0VIP0  LRSK7_ORYSJ  MPPRCRRLPLLFILLLAVRPLSAAAASSIAAAPASSYRRISWASNL...   \n",
       "\n",
       "                                         Domain [FT]   \n",
       "0  DOMAIN 22..149; /note=\"Bulb-type lectin\"; /evi...  \\\n",
       "1  DOMAIN 20..274; /note=\"Protein kinase\"; /evide...   \n",
       "2  DOMAIN 27..278; /note=\"Protein kinase\"; /evide...   \n",
       "3  DOMAIN 69..130; /note=\"SH3\"; /evidence=\"ECO:00...   \n",
       "4  DOMAIN 389..661; /note=\"Protein kinase\"; /evid...   \n",
       "\n",
       "                                    Protein families  \n",
       "0  Protein kinase superfamily, Ser/Thr protein ki...  \n",
       "1  Protein kinase superfamily, Ser/Thr protein ki...  \n",
       "2  Protein kinase superfamily, CAMK Ser/Thr prote...  \n",
       "3  Protein kinase superfamily, STE Ser/Thr protei...  \n",
       "4  Leguminous lectin family; Protein kinase super...  "
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Entry</th>\n",
       "      <th>Entry Name</th>\n",
       "      <th>Sequence</th>\n",
       "      <th>Domain [FT]</th>\n",
       "      <th>Protein families</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A0A075F7E9</td>\n",
       "      <td>LERK1_ORYSI</td>\n",
       "      <td>MVALLLFPMLLQLLSPTCAQTQKNITLGSTLAPQGPASSWLSPSGD...</td>\n",
       "      <td>DOMAIN 22..149; /note=\"Bulb-type lectin\"; /evi...</td>\n",
       "      <td>Protein kinase superfamily, Ser/Thr protein ki...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>A0A078CGE6</td>\n",
       "      <td>M3KE1_BRANA</td>\n",
       "      <td>MARQMTSSQFHKSKTLDNKYMLGDEIGKGAYGRVYIGLDLENGDFV...</td>\n",
       "      <td>DOMAIN 20..274; /note=\"Protein kinase\"; /evide...</td>\n",
       "      <td>Protein kinase superfamily, Ser/Thr protein ki...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A0A0B4J2F2</td>\n",
       "      <td>SIK1B_HUMAN</td>\n",
       "      <td>MVIMSEFSADPAGQGQGQQKPLRVGFYDIERTLGKGNFAVVKLARH...</td>\n",
       "      <td>DOMAIN 27..278; /note=\"Protein kinase\"; /evide...</td>\n",
       "      <td>Protein kinase superfamily, CAMK Ser/Thr prote...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>A0A0K3AV08</td>\n",
       "      <td>MLK1_CAEEL</td>\n",
       "      <td>MEQASVPSYVNIPPIAKTRSTSHLAPTPEHHRSVSYEDTTTASTST...</td>\n",
       "      <td>DOMAIN 69..130; /note=\"SH3\"; /evidence=\"ECO:00...</td>\n",
       "      <td>Protein kinase superfamily, STE Ser/Thr protei...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>A0A0P0VIP0</td>\n",
       "      <td>LRSK7_ORYSJ</td>\n",
       "      <td>MPPRCRRLPLLFILLLAVRPLSAAAASSIAAAPASSYRRISWASNL...</td>\n",
       "      <td>DOMAIN 389..661; /note=\"Protein kinase\"; /evid...</td>\n",
       "      <td>Leguminous lectin family; Protein kinase super...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 8
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize profile as `PyHMMer` object internally interfacing homonimous library.\n",
    "`PyHMMer` is an annotator class, annotating (subsetting) `ChainSequence`s using\n",
    "HMM-derived domain boundaries."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:17.082778Z",
     "start_time": "2024-09-25T17:07:16.407186Z"
    }
   },
   "source": [
    "prof = PyHMMer(BytesIO(gzip.decompress(\n",
    "    fetch_text(LINK_PFAM, decode=False)\n",
    ")))"
   ],
   "outputs": [],
   "execution_count": 9
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Convert sequences\n",
    "\n",
    "Initialize `ChainSequence`s and wrap them into the `ChainList`.\n",
    "The latter simplifies working with a group of `Chain*`-type objects.\n",
    "To learn more, check [lXtractor documentation](https://lxtractor.readthedocs.io/en/latest/)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:17.122978Z",
     "start_time": "2024-09-25T17:07:17.121119Z"
    }
   },
   "source": [
    "def wrap_into_chain_seq(row):\n",
    "    \"\"\"\n",
    "    :param row: DataFrame row.\n",
    "    :return: chain sequence using \"Sequence\" field to obtain a full\n",
    "        protein sequence and \"Entry Name\" for its name.\n",
    "    \"\"\"\n",
    "    fam_df = row['Protein families']\n",
    "    family = 'other'\n",
    "    if 'protein kinase family' in fam_df:\n",
    "        if 'Ser/Thr' in fam_df:\n",
    "            family = 'STk'\n",
    "        elif 'Tyr' in fam_df:\n",
    "            family = 'Tk'\n",
    "\n",
    "    return lxc.ChainSequence.from_string(\n",
    "        row['Sequence'], name=row['Entry Name'], meta={'Family': family}\n",
    "    )"
   ],
   "outputs": [],
   "execution_count": 10
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:17.315303Z",
     "start_time": "2024-09-25T17:07:17.167270Z"
    }
   },
   "source": [
    "chains = lxc.ChainList(\n",
    "    wrap_into_chain_seq(r) for _, r in df_up.iterrows()\n",
    ")\n",
    "Counter(c.meta['Family'] for c in chains)"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'STk': 3702, 'Tk': 551, 'other': 248})"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 11
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select only those whose family was classified as STk or Tk."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:17.334315Z",
     "start_time": "2024-09-25T17:07:17.331867Z"
    }
   },
   "source": "chains = chains.filter(lambda x: x.meta['Family'] != 'other')",
   "outputs": [],
   "execution_count": 12
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below, alignment-extracted boundaries are used to create a \"child\"\n",
    "subsequence for each `ChainSequence` in `chains`. We also specify acceptance\n",
    "criteria: minimum length of extracted domain, minimum coverage, i.e., how many\n",
    "HMM positions were covered by the extracted sequence, and minimum bit score\n",
    "(which is a bit higher than the profile's own threshold; just to be safe)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:36.026636Z",
     "start_time": "2024-09-25T17:07:17.375487Z"
    }
   },
   "source": [
    "consume(prof.annotate(\n",
    "    chains, new_map_name='PK', min_size=150, min_cov_hmm=0.7, min_score=50\n",
    "));"
   ],
   "outputs": [],
   "execution_count": 13
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:36.135847Z",
     "start_time": "2024-09-25T17:07:36.080569Z"
    }
   },
   "source": [
    "len(chains), len(chains.collapse_children())"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4253, 3911)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 14
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the final number of chains with annotated domains under the\n",
    "filtering criteria specified above differs from the initial number of chains.\n",
    "UniProt obtains PK domain annotations using a different profile using `ProSite`\n",
    "software. Thus, the filtering criteria are essentially sensitivity tweaks."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:36.197295Z",
     "start_time": "2024-09-25T17:07:36.193486Z"
    }
   },
   "source": [
    "chains = chains.filter(lambda x: len(x.children) > 0)\n",
    "len(chains), Counter(c.meta['Family'] for c in chains)"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3861, Counter({'STk': 3406, 'Tk': 455}))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 15
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Prepare encoded dataset\n",
    "\n",
    "We encode the alignment using the [ProtFP embeddings](https://lxtractor.readthedocs.io/en/latest/lXtractor.variables.html#lXtractor.variables.base.ProtFP) derived from the PCA of the AAIndex database. Thus, they encapsulate many physicochemical properties in several compact dimensions."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:36.245925Z",
     "start_time": "2024-09-25T17:07:36.243977Z"
    }
   },
   "source": [
    "# The number of PCA components to take.\n",
    "N_COMP = 3"
   ],
   "outputs": [],
   "execution_count": 16
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define variables -- `N_COMP` PCA components for each profile position.\n",
    "`lXtractor` is shipped with values already and the \"calculation\" of variables\n",
    "below is simply accessing the table values for each valid residue. By default,\n",
    "`NaN` used as a placeholder for empty values (gaps in the sequence-to-profile\n",
    "alignment)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:07:36.296641Z",
     "start_time": "2024-09-25T17:07:36.293132Z"
    }
   },
   "source": [
    "variables = list(chain.from_iterable(\n",
    "    (lxv.PFP(pos, i) for i in range(1, N_COMP + 1)) for pos in range(1, prof.hmm.M + 1)\n",
    "))\n",
    "len(variables)"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "789"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 17
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:16.462554Z",
     "start_time": "2024-09-25T17:07:36.360876Z"
    }
   },
   "source": [
    "manager = lxv.Manager(verbose=True)\n",
    "calculator = lxv.GenericCalculator()  # You may increase the number of processes here.\n",
    "domains = chains.collapse_children()\n",
    "df = manager.aggregate_from_it(\n",
    "    # `calculate` returns a generator of the calculation results.\n",
    "    # We immediately aggregate the results into a single table.\n",
    "    manager.calculate(domains, variables, calculator, map_name='PK'), \n",
    "    num_vs=len(variables)\n",
    ")"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Accumulating calculations: 0it [00:00, ?it/s]"
      ],
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "766f99f279da484f93a23fd10d5033db"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "execution_count": 18
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, incorporate labels for STk (0) and Tk (1) domain sequences."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:16.488974Z",
     "start_time": "2024-09-25T17:08:16.485219Z"
    }
   },
   "source": [
    "cls_map = {'STk': 0, 'Tk': 1}\n",
    "id2cls = {s.id: cls_map[s.parent.meta['Family']] for s in domains}\n",
    "df['IsTk'] = df['ObjectID'].map(id2cls)"
   ],
   "outputs": [],
   "execution_count": 19
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:16.551265Z",
     "start_time": "2024-09-25T17:08:16.541824Z"
    }
   },
   "source": [
    "df.head()"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "                            ObjectID  PFP(p=1,i=1)  PFP(p=1,i=2)   \n",
       "0  PK_1|526-791<-(LERK1_ORYSI|1-813)           NaN           NaN  \\\n",
       "1  PK_1|20-274<-(M3KE1_BRANA|1-1299)          3.14          3.59   \n",
       "2   PK_1|27-278<-(SIK1B_HUMAN|1-783)          3.14          3.59   \n",
       "3  PK_1|205-445<-(MLK1_CAEEL|1-1059)           NaN           NaN   \n",
       "4  PK_1|392-592<-(LRSK7_ORYSJ|1-695)           NaN           NaN   \n",
       "\n",
       "   PFP(p=1,i=3)  PFP(p=2,i=1)  PFP(p=2,i=2)  PFP(p=2,i=3)  PFP(p=3,i=1)   \n",
       "0           NaN           NaN           NaN           NaN           NaN  \\\n",
       "1          2.45          5.11          0.19         -1.02          5.76   \n",
       "2          2.45         -6.61          0.94         -3.04          6.58   \n",
       "3           NaN           NaN           NaN           NaN           NaN   \n",
       "4           NaN           NaN           NaN           NaN           NaN   \n",
       "\n",
       "   PFP(p=3,i=2)  PFP(p=3,i=3)  ...  PFP(p=261,i=1)  PFP(p=261,i=2)   \n",
       "0           NaN           NaN  ...             NaN             NaN  \\\n",
       "1         -1.33         -1.71  ...           -3.82           -2.31   \n",
       "2         -1.73         -2.49  ...           -2.79            6.60   \n",
       "3           NaN           NaN  ...             NaN             NaN   \n",
       "4           NaN           NaN  ...             NaN             NaN   \n",
       "\n",
       "   PFP(p=261,i=3)  PFP(p=262,i=1)  PFP(p=262,i=2)  PFP(p=262,i=3)   \n",
       "0             NaN             NaN             NaN             NaN  \\\n",
       "1            3.45            7.33            4.55            2.77   \n",
       "2            1.21            7.33            4.55            2.77   \n",
       "3             NaN             NaN             NaN             NaN   \n",
       "4             NaN             NaN             NaN             NaN   \n",
       "\n",
       "   PFP(p=263,i=1)  PFP(p=263,i=2)  PFP(p=263,i=3)  IsTk  \n",
       "0             NaN             NaN             NaN     0  \n",
       "1            6.58           -1.73           -2.49     0  \n",
       "2            5.11            0.19           -1.02     0  \n",
       "3             NaN             NaN             NaN     0  \n",
       "4             NaN             NaN             NaN     0  \n",
       "\n",
       "[5 rows x 791 columns]"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ObjectID</th>\n",
       "      <th>PFP(p=1,i=1)</th>\n",
       "      <th>PFP(p=1,i=2)</th>\n",
       "      <th>PFP(p=1,i=3)</th>\n",
       "      <th>PFP(p=2,i=1)</th>\n",
       "      <th>PFP(p=2,i=2)</th>\n",
       "      <th>PFP(p=2,i=3)</th>\n",
       "      <th>PFP(p=3,i=1)</th>\n",
       "      <th>PFP(p=3,i=2)</th>\n",
       "      <th>PFP(p=3,i=3)</th>\n",
       "      <th>...</th>\n",
       "      <th>PFP(p=261,i=1)</th>\n",
       "      <th>PFP(p=261,i=2)</th>\n",
       "      <th>PFP(p=261,i=3)</th>\n",
       "      <th>PFP(p=262,i=1)</th>\n",
       "      <th>PFP(p=262,i=2)</th>\n",
       "      <th>PFP(p=262,i=3)</th>\n",
       "      <th>PFP(p=263,i=1)</th>\n",
       "      <th>PFP(p=263,i=2)</th>\n",
       "      <th>PFP(p=263,i=3)</th>\n",
       "      <th>IsTk</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PK_1|526-791&lt;-(LERK1_ORYSI|1-813)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>PK_1|20-274&lt;-(M3KE1_BRANA|1-1299)</td>\n",
       "      <td>3.14</td>\n",
       "      <td>3.59</td>\n",
       "      <td>2.45</td>\n",
       "      <td>5.11</td>\n",
       "      <td>0.19</td>\n",
       "      <td>-1.02</td>\n",
       "      <td>5.76</td>\n",
       "      <td>-1.33</td>\n",
       "      <td>-1.71</td>\n",
       "      <td>...</td>\n",
       "      <td>-3.82</td>\n",
       "      <td>-2.31</td>\n",
       "      <td>3.45</td>\n",
       "      <td>7.33</td>\n",
       "      <td>4.55</td>\n",
       "      <td>2.77</td>\n",
       "      <td>6.58</td>\n",
       "      <td>-1.73</td>\n",
       "      <td>-2.49</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>PK_1|27-278&lt;-(SIK1B_HUMAN|1-783)</td>\n",
       "      <td>3.14</td>\n",
       "      <td>3.59</td>\n",
       "      <td>2.45</td>\n",
       "      <td>-6.61</td>\n",
       "      <td>0.94</td>\n",
       "      <td>-3.04</td>\n",
       "      <td>6.58</td>\n",
       "      <td>-1.73</td>\n",
       "      <td>-2.49</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.79</td>\n",
       "      <td>6.60</td>\n",
       "      <td>1.21</td>\n",
       "      <td>7.33</td>\n",
       "      <td>4.55</td>\n",
       "      <td>2.77</td>\n",
       "      <td>5.11</td>\n",
       "      <td>0.19</td>\n",
       "      <td>-1.02</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PK_1|205-445&lt;-(MLK1_CAEEL|1-1059)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>PK_1|392-592&lt;-(LRSK7_ORYSJ|1-695)</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 791 columns</p>\n",
       "</div>"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 20
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Run feature selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Setup\n",
    "\n",
    "For convenience, we'll define several helper functions and a `Dataset` dataclass\n",
    "holding the full dataset."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:17.115756Z",
     "start_time": "2024-09-25T17:08:16.709368Z"
    }
   },
   "source": [
    "from dataclasses import dataclass\n",
    "\n",
    "import numpy as np\n",
    "from eBoruta import eBoruta\n",
    "from sklearn.metrics import f1_score\n",
    "from sklearn.model_selection import StratifiedShuffleSplit\n",
    "from xgboost import XGBClassifier"
   ],
   "outputs": [],
   "execution_count": 21
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:17.317801Z",
     "start_time": "2024-09-25T17:08:17.314517Z"
    }
   },
   "source": [
    "@dataclass\n",
    "class Dataset:\n",
    "    df: pd.DataFrame\n",
    "    x_names: list[str]\n",
    "    y_name: str\n",
    "\n",
    "    @property\n",
    "    def x(self) -> pd.DataFrame:\n",
    "        return self.df[self.x_names]\n",
    "\n",
    "    @property\n",
    "    def y(self) -> pd.Series:\n",
    "        return self.df[self.y_name]\n",
    "\n",
    "\n",
    "def sel_by_idx(ds, idx):\n",
    "    \"\"\"\n",
    "    :param ds: Dataset.\n",
    "    :param idx: Array of numeric indices.\n",
    "    :return: New `Dataset` instance with the same variables\n",
    "        and rows selected by `idx`.\n",
    "    \"\"\"\n",
    "    return Dataset(ds.df.iloc[idx], ds.x_names, ds.y_name)\n",
    "\n",
    "\n",
    "def score(ds, model, score_fn=f1_score):\n",
    "    \"\"\"\n",
    "    F1-score for the classifier model.\n",
    "    \"\"\"\n",
    "    return score_fn(ds.y.values, model.predict(ds.x))\n",
    "\n",
    "\n",
    "def cv(ds, model, n=10, score_fn=f1_score, agg_fn=np.mean):\n",
    "    \"\"\"\n",
    "    Cross validate the model returning the aggregated score.\n",
    "    \"\"\"\n",
    "    scores = []\n",
    "    splitter = StratifiedShuffleSplit(n_splits=n)\n",
    "    for train_idx, test_idx in splitter.split(ds.x, ds.y):\n",
    "        ds_train = sel_by_idx(ds, train_idx)\n",
    "        ds_test = sel_by_idx(ds, test_idx)\n",
    "        _model = model.__class__(**model.get_params())\n",
    "        _model.fit(ds_train.x, ds_train.y)\n",
    "        scores.append(score(ds_test, _model, score_fn))\n",
    "    return agg_fn(scores)\n",
    "\n",
    "# def plot_imp_history(df_history: pd.DataFrame):\n",
    "#     sns.lineplot(x='Step', y='Importance', hue='Feature', data=df_history)\n",
    "#     sns.lineplot(x='Step', y='Threshold', data=df_history, linestyle='--', linewidth=4)"
   ],
   "outputs": [],
   "execution_count": 22
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:17.364798Z",
     "start_time": "2024-09-25T17:08:17.361436Z"
    }
   },
   "source": [
    "# How many top features to review?\n",
    "TOP_N = 15"
   ],
   "outputs": [],
   "execution_count": 23
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:17.576489Z",
     "start_time": "2024-09-25T17:08:17.574244Z"
    }
   },
   "source": [
    "dataset = Dataset(df, [c for c in df.columns if 'PFP' in c], 'IsTk')\n",
    "classifier = XGBClassifier(n_jobs=-1)"
   ],
   "outputs": [],
   "execution_count": 24
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Cross-validate the initial model\n",
    "\n",
    "Before attempting anything, let's just first cross-validate the model on the\n",
    "full dataset. We should obtain a decent result by itself. It's good practice\n",
    "to test whether subsequent feature selection deteriorates this baseline result."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:08:51.231036Z",
     "start_time": "2024-09-25T17:08:17.658618Z"
    }
   },
   "source": [
    "cv(dataset, classifier)"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9829271066908001"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 25
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 Select Features\n",
    "\n",
    "Run the feature selection itself. We'll keep the default parameters and a large\n",
    "number of iterations to obtain final decisions for each feature. Then we'll rank\n",
    "features and select top N features to analyze further."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:13:18.234660Z",
     "start_time": "2024-09-25T17:08:51.325216Z"
    }
   },
   "source": [
    "boruta = eBoruta(n_iter=100, shap_approximate=False, shap_tree=True)\n",
    "boruta.fit(dataset.x, dataset.y, model_type=classifier, verbose=0)"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Boruta trials:   0%|          | 0/100 [00:00<?, ?it/s]"
      ],
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "138876c1af0447eda3ac0ac971feabcb"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n",
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "eBoruta(n_iter=100)"
      ],
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>eBoruta(n_iter=100)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">eBoruta</label><div class=\"sk-toggleable__content\"><pre>eBoruta(n_iter=100)</pre></div></div></div></div></div>"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 26
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we'll rank the accepted features. Ranking means refitting the model using\n",
    "supplied (here, accepted) features, calculating and sorting by feature importance\n",
    "values."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:13:25.482122Z",
     "start_time": "2024-09-25T17:13:18.264094Z"
    }
   },
   "source": [
    "ranks = boruta.rank(features=list(boruta.features_.accepted), fit=True)\n",
    "ranks.sort_values('Importance', ascending=False).head(TOP_N)"
   ],
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "ntree_limit is deprecated, use `iteration_range` or model slicing instead.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "           Feature  Importance\n",
       "6   PFP(p=125,i=1)    2.644531\n",
       "22  PFP(p=244,i=2)    0.712439\n",
       "14  PFP(p=162,i=3)    0.476352\n",
       "1    PFP(p=33,i=2)    0.411718\n",
       "11  PFP(p=159,i=3)    0.401334\n",
       "7   PFP(p=125,i=2)    0.387498\n",
       "5    PFP(p=92,i=2)    0.365375\n",
       "9   PFP(p=127,i=3)    0.345326\n",
       "0    PFP(p=25,i=1)    0.343269\n",
       "17  PFP(p=168,i=2)    0.307936\n",
       "20  PFP(p=200,i=2)    0.271046\n",
       "3    PFP(p=46,i=3)    0.270913\n",
       "21  PFP(p=215,i=1)    0.267759\n",
       "10  PFP(p=157,i=2)    0.252067\n",
       "16  PFP(p=168,i=1)    0.243983"
      ],
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Feature</th>\n",
       "      <th>Importance</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>PFP(p=125,i=1)</td>\n",
       "      <td>2.644531</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>PFP(p=244,i=2)</td>\n",
       "      <td>0.712439</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>PFP(p=162,i=3)</td>\n",
       "      <td>0.476352</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>PFP(p=33,i=2)</td>\n",
       "      <td>0.411718</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>PFP(p=159,i=3)</td>\n",
       "      <td>0.401334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>PFP(p=125,i=2)</td>\n",
       "      <td>0.387498</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>PFP(p=92,i=2)</td>\n",
       "      <td>0.365375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>PFP(p=127,i=3)</td>\n",
       "      <td>0.345326</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>PFP(p=25,i=1)</td>\n",
       "      <td>0.343269</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>PFP(p=168,i=2)</td>\n",
       "      <td>0.307936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>PFP(p=200,i=2)</td>\n",
       "      <td>0.271046</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>PFP(p=46,i=3)</td>\n",
       "      <td>0.270913</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>PFP(p=215,i=1)</td>\n",
       "      <td>0.267759</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>PFP(p=157,i=2)</td>\n",
       "      <td>0.252067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>PFP(p=168,i=1)</td>\n",
       "      <td>0.243983</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 27
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:13:25.524986Z",
     "start_time": "2024-09-25T17:13:25.522326Z"
    }
   },
   "source": [
    "top_n_features = ranks.Feature[:TOP_N].tolist()"
   ],
   "outputs": [],
   "execution_count": 28
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:13:25.593906Z",
     "start_time": "2024-09-25T17:13:25.591309Z"
    }
   },
   "source": [
    "ds_sel = Dataset(\n",
    "    df[['ObjectID', *top_n_features, dataset.y_name]], top_n_features, dataset.y_name\n",
    ")"
   ],
   "outputs": [],
   "execution_count": 29
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 CV score using selected features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we'll use the dataset with accepted features to cross-validate the model.\n",
    "Indeed, the CV score is very similar to the previously obtained on a full dataset."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.240051Z",
     "start_time": "2024-09-25T17:13:25.636394Z"
    }
   },
   "source": [
    "cv(ds_sel, classifier)"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9787877337475468"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 30
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Interpret the results\n",
    "\n",
    "Here, we'll check correspondence between MSA positions of Juswinder and those\n",
    "produced by feature selection. We'll do this based on  conservation patterns'\n",
    "similarity at selected positions and those depicted in Figures 2 and 3.\n",
    "\n",
    "Note that we conclude based on a quick visual inspection rather than building\n",
    "a full MSA and relating different positions. Still, one may verify externally\n",
    "that positions from the paper -- 168, 170, 201, 203 -- correspond to HMM profile\n",
    "positions 125, 127, 160, and 162."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.283927Z",
     "start_time": "2024-09-25T17:14:32.280964Z"
    }
   },
   "source": [
    "POS_PATTERN = re.compile('p=(\\d+)')"
   ],
   "outputs": [],
   "execution_count": 31
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.334916Z",
     "start_time": "2024-09-25T17:14:32.329160Z"
    }
   },
   "source": [
    "ds_sample = ds_sel.df.groupby('IsTk').sample(30)\n",
    "positions = sorted(\n",
    "    {int(POS_PATTERN.findall(c)[0]) for c in ds_sample.columns if 'p=' in c}\n",
    ")\n",
    "positions"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[25, 33, 42, 46, 72, 92, 125, 127, 157, 159, 160, 161, 162]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 32
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.382699Z",
     "start_time": "2024-09-25T17:14:32.379834Z"
    }
   },
   "source": [
    "def extract_positions(c: lxc.ChainSequence) -> str:\n",
    "    m = c.get_map('PK')\n",
    "    mapped = map(\n",
    "        lambda x: x if x == '-' else x.seq1,\n",
    "        (m.get(p, '-') for p in positions))\n",
    "    return \"\".join(mapped)"
   ],
   "outputs": [],
   "execution_count": 33
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.507308Z",
     "start_time": "2024-09-25T17:14:32.428471Z"
    }
   },
   "source": [
    "sample_chains = [domains[x].pop() for x in ds_sample['ObjectID']]"
   ],
   "outputs": [],
   "execution_count": 34
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.552343Z",
     "start_time": "2024-09-25T17:14:32.550432Z"
    }
   },
   "source": [
    "positions"
   ],
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[25, 33, 42, 46, 72, 92, 125, 127, 157, 159, 160, 161, 162]"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "execution_count": 35
  },
  {
   "cell_type": "code",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "ExecuteTime": {
     "end_time": "2024-09-25T17:14:32.604885Z",
     "start_time": "2024-09-25T17:14:32.597161Z"
    }
   },
   "source": [
    "seqs = map(extract_positions, sample_chains)\n",
    "\n",
    "for is_tk, seq in zip(ds_sample.IsTk, seqs):\n",
    "    print(seq, is_tk)"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ER-CVNKEFGTLQ 0\n",
      "QDLRYISSMGTPN 0\n",
      "RTEELSKTVSALG 0\n",
      "RSTPGKKEFGTRV 0\n",
      "KRFITPKQEVTLW 0\n",
      "KKKRKRKEFGTIE 0\n",
      "KSAYD-KSYVTRY 0\n",
      "EDPKKQKECGSLA 0\n",
      "QSEVEAKSVGTYG 0\n",
      "GELKHKKKLLSHD 0\n",
      "---QRHTDERNLH 0\n",
      "VRPIRSKPEVTLW 0\n",
      "DQ-VITKSFGTYT 0\n",
      "VEDMKDKGFGTPY 0\n",
      "RSEIDEKEFGTTE 0\n",
      "TRELNNKGLGTLH 0\n",
      "TNLQENKAVGTIG 0\n",
      "TAEKEKKAVGSFG 0\n",
      "SI--YMKHYCSRY 0\n",
      "TSEKEKKGVGTIG 0\n",
      "ND-THDSRVGSPV 0\n",
      "FFVEKSKEMGTMD 0\n",
      "KLPLSHKARVTLW 0\n",
      "KRPVKPKEVVTLW 0\n",
      "YNVRDHKESGSPN 0\n",
      "EGALDNKTYVTRW 0\n",
      "KKRYYKKESGSPC 0\n",
      "EP-HYTKDLGTAR 0\n",
      "QK-KT-KHRASRY 0\n",
      "EEKCHMKEFGTEE 0\n",
      "TK-LPGRARGAKK 1\n",
      "LT-FERARSDTPR 1\n",
      "TK-IPEARGLPVK 1\n",
      "MR-ESMARSMPVR 1\n",
      "IK-LPDARRGKIR 1\n",
      "NK-LGRARGLPVK 1\n",
      "IN-LSEARALPVK 1\n",
      "IK-LPHARRGKIR 1\n",
      "NK-LGRARGLPVK 1\n",
      "KK-LPDARRGKIR 1\n",
      "NK-VPNARGLPVK 1\n",
      "EQDENHARGSPIF 1\n",
      "TK-EERKPPRSLG 1\n",
      "TK-LPGRARGAKK 1\n",
      "TK-LPGRARGAKK 1\n",
      "IKIFPQARKVPFP 1\n",
      "QN-RPPARTLPVR 1\n",
      "TK-LPGRARGAKK 1\n",
      "LK-LPRARHGTKK 1\n",
      "VKAIPQARRVPFA 1\n",
      "LK-LPDARRGKIR 1\n",
      "QKILKPARAMPVK 1\n",
      "TS-LGQARALPVK 1\n",
      "DD-FENCKLERIP 1\n",
      "AK-VPEARGLPVK 1\n",
      "TK-LPGRARGAKK 1\n",
      "NR-IR-SKIGTPG 1\n",
      "VK-LPDARGIPVR 1\n",
      "QN-RP-ARTLPIR 1\n",
      "IN-LSEARALPVK 1\n"
     ]
    }
   ],
   "execution_count": 36
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One may notice from above the following correspondence between Figures 2 and 3\n",
    "and conservation patterns observed above. Firstly, position 125 clearly\n",
    "corresponds to position 168 on Fig. 3, with conserved K in STk and A/R in Tk.\n",
    "Furthermore, position 127 corresponds to 170 on Fig. 3, with conserved R in Tk.\n",
    "Moving on, position 162 corresponds to 203 in Fig. 2, with conserved K and R in\n",
    "Tk.\n",
    "\n",
    "Thus, feature selection managed to retrieve features whose conservation patterns\n",
    "differed between STk and Tk. Moreover, selected feature pertain to positions\n",
    "implicated as critical for the STk/Tk functional differences by a research\n",
    "paper; thus, clearly relevant."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}