CORD19 - Round 1 Response by UW and NLPCORE.ipynb 75.3 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project Description\n",
    "This collaborative project is put together by students of TCSS 592 at the [School of Engineering and Technology, University of Washington Tacoma](https://www.tacoma.uw.edu/set/school-engineering-technology-home) and [NLPCORE](https://nlpcore.com) a Seattle, WA startup using NLPCORE's search engine to extract meaningful phrases (concepts) grouped together in named categories (topics) along with their specific linkages / relationships (joint references) in the literature. These topics could be dictionary terms such as Proteins, Cell Lines or user specified such as (Host Cells, Viruses) or dynamically extracted by the search engine based upon search terms.\n",
    "\n",
    "The objective of the project is to provide most relevant and specific references (not just articles but specific sentences with-in each article) along with relevant biologial materials as a response for the questions posed in this challenge. Our goal is to enable life scienes researchers to quickly gather, triage and identify most applicable subset of candidate protiens and/or reagents for their experiments related to Covid-19 research.\n",
    "\n",
    "## Background\n",
    "Varun Mittal - cofounder at NLPCORE, is a University of Washington alumni with Masters CS degree in AI and ML techniques and has remained as faculty support for Prof. Dr. Ka Yee Yeung at UW, who is conducting its TCSS592 class this spring. This CORD19 challenge has provided a unique opportunity for both UW TCSS592 class students and NLPORE team to work together under guidance of Dr. Ka Yee.\n",
    "\n",
    "NLPCORE is a knowledge discovery platform powered by its unique AI and ML techniques (US Patents:  [#10102274](https://patents.google.com/patent/US10102274B2) & [#10372739](https://patents.google.com/patent/US20190005049A1)) that delivers contextual and actionable results for users across various verticals – life sciences, case law, patents, insurance and more.\n",
    "\n",
    "![Identify Entities and Relationships using Part of Speech tags](https://i.imgur.com/dXT19EW.png)\n",
    "\n",
    "Its search technology collects statistics such as word frequencies, offsets as well as part of speech tags (e.g. noun, pronoun, or verb) in its index. Words that appear most frequently and closest to the search keyword(s), provide seed articles for its neural-net algorithms that also factor in heuristics, dictionaries and in-place user-feedback. For any given search keyword(s), its search engine scans across all matching articles deploying a (Hadoop like) cluster of processing nodes to identify and retrieve the most appropriate concepts, their grouping into meaningful topics, their relationships to each other and their specific annotated references from the entire text corpus.\n",
    "\n",
    "In this project submission, we have used both the dataset provided as well as the open-access subset from NIH (pubmed central) to focus on all coronavirus related research and extract related content from both existing and newly available research.\n",
    "\n",
    "## Extracting, Analysing and Presenting Results\n",
    "In order to respond to challenge questions, we submitted a number of search keywords to NLPCORE along with suggested topics to extract based upon students' research and suggestions. We collected these results ie concepts, topics and their joint references into dataframes. We then experimented with a number of concept/link attributes such as frequency of terms or coocurrences, distance of these terms from searched keywords, their part of speech tags (mostly pronounts, nouns or verbs), topics they belong to, etc. as way to filter the dataframes to the most meaningful subset. We then present the output along with individual text references (Article Ref, Title, Section Title, Surrounding Sentences) in recommended table format that can be readily exported as a CSV file and consumed by the researchers for their further analysis and experimentation.\n",
    "\n",
    "## Future Plans\n",
    "We plan to improve both the quality and presentation of our initial submission in Round 2 of this challenge and enable one-click access to a search portal where users can interact with results in various formats such as document, list or graph views, filter them at will for any combination of attributes from topics, concepts  and/or articles and jump to specific article with color-coded highlights (where color represents a topic category).\n",
    "\n",
    "Beyond this Kaggle challenge, we also wish to engage with life sciences researchers directly, help them apply results from our technology (that remains available to research community at no cost) for their experiments and identify areas of further improvements in our toolset.\n",
    "\n",
    "## Acknowledgements\n",
    "We at NLPCORE acknowledge Prof. Dr. Ka Yee Yeung and her class of TCSS592 in particular Abigail Jerger and Emma Briggs who contributed immensely to research for and prepare this submission."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tasks Attempted\n",
    "For round 1, we have attempted to respond to following CORD-19 challenges.\n",
    "\n",
    "* Task 1: What is known about transmission, incubation, and environmental stability?\n",
    "* Task 2: What do we know about COVID-19 risk factors?\n",
    "* Task 3: What do we know about virus genetics, origin, and evolution?\n",
    "* Task 4: What do we know about vaccines and therapeutics?\n",
    "\n",
    "For each task, we took the key phrases (mostly unique words) from the description of the task itself and forced our search engine to search the neighborhood of these words together with mention of coronavirus itself in the literature and extracted the most frequent concepts, biomaterials (proteins and cells) and their combined references in articles. These references should in most cases approximate the response to the challenge posed. We may have noisy results in our first round submission but will attempt to improve upon the same in our subsequent submission."
   ]
  },
  {
   "cell_type": "code",
52
   "execution_count": 15,
53
   "metadata": {},
Naveen Garg's avatar
Naveen Garg committed
54 55 56 57 58 59 60 61
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: jupyter_datatables in /opt/conda/lib/python3.6/site-packages\n",
      "Requirement already satisfied: ipykernel in /opt/conda/lib/python3.6/site-packages (from jupyter_datatables)\n",
      "Requirement already satisfied: pandas>=0.23.0 in /opt/conda/lib/python3.6/site-packages (from jupyter_datatables)\n",
62 63 64
      "Requirement already satisfied: scipy in /opt/conda/lib/python3.6/site-packages (from jupyter_datatables)\n",
      "Requirement already satisfied: ipython in /opt/conda/lib/python3.6/site-packages (from jupyter_datatables)\n",
      "Requirement already satisfied: jupyter-require>=0.3.0 in /opt/conda/lib/python3.6/site-packages (from jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
65
      "Requirement already satisfied: jupyter-client in /opt/conda/lib/python3.6/site-packages (from ipykernel->jupyter_datatables)\n",
66 67 68 69 70
      "Requirement already satisfied: traitlets>=4.1.0 in /opt/conda/lib/python3.6/site-packages (from ipykernel->jupyter_datatables)\n",
      "Requirement already satisfied: tornado>=4.0 in /opt/conda/lib/python3.6/site-packages (from ipykernel->jupyter_datatables)\n",
      "Requirement already satisfied: numpy>=1.13.3 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.23.0->jupyter_datatables)\n",
      "Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.23.0->jupyter_datatables)\n",
      "Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.6/site-packages (from pandas>=0.23.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
71 72 73
      "Requirement already satisfied: pexpect; sys_platform != \"win32\" in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
      "Requirement already satisfied: setuptools>=18.5 in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
      "Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.15 in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
74 75 76
      "Requirement already satisfied: backcall in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
      "Requirement already satisfied: decorator in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
      "Requirement already satisfied: simplegeneric>0.8 in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
77
      "Requirement already satisfied: pygments in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
78 79 80 81 82 83 84 85 86 87 88 89 90 91
      "Requirement already satisfied: jedi>=0.10 in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
      "Requirement already satisfied: pickleshare in /opt/conda/lib/python3.6/site-packages (from ipython->jupyter_datatables)\n",
      "Requirement already satisfied: jupyter-nbutils in /opt/conda/lib/python3.6/site-packages (from jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: jupyter-contrib-nbextensions in /opt/conda/lib/python3.6/site-packages (from jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: daiquiri in /opt/conda/lib/python3.6/site-packages (from jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: csscompressor in /opt/conda/lib/python3.6/site-packages (from jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: jupyter_core in /opt/conda/lib/python3.6/site-packages (from jupyter-client->ipykernel->jupyter_datatables)\n",
      "Requirement already satisfied: pyzmq>=13 in /opt/conda/lib/python3.6/site-packages (from jupyter-client->ipykernel->jupyter_datatables)\n",
      "Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from traitlets>=4.1.0->ipykernel->jupyter_datatables)\n",
      "Requirement already satisfied: ipython-genutils in /opt/conda/lib/python3.6/site-packages (from traitlets>=4.1.0->ipykernel->jupyter_datatables)\n",
      "Requirement already satisfied: ptyprocess>=0.5 in /opt/conda/lib/python3.6/site-packages (from pexpect; sys_platform != \"win32\"->ipython->jupyter_datatables)\n",
      "Requirement already satisfied: wcwidth in /opt/conda/lib/python3.6/site-packages (from prompt-toolkit<2.0.0,>=1.0.15->ipython->jupyter_datatables)\n",
      "Requirement already satisfied: parso>=0.3.0 in /opt/conda/lib/python3.6/site-packages (from jedi>=0.10->ipython->jupyter_datatables)\n",
      "Requirement already satisfied: notebook>=4.0 in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
92
      "Requirement already satisfied: lxml in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
93
      "Requirement already satisfied: jupyter-contrib-core>=0.3.3 in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
94
      "Requirement already satisfied: jupyter-highlight-selected-word>=0.1.1 in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
95
      "Requirement already satisfied: jupyter-nbextensions-configurator>=0.4.0 in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
96 97
      "Requirement already satisfied: pyyaml in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: jupyter-latex-envs>=1.3.8 in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
98
      "Requirement already satisfied: nbconvert>=4.2 in /opt/conda/lib/python3.6/site-packages (from jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
99
      "Requirement already satisfied: python-json-logger in /opt/conda/lib/python3.6/site-packages (from daiquiri->jupyter-require>=0.3.0->jupyter_datatables)\n",
100 101
      "Requirement already satisfied: jinja2 in /opt/conda/lib/python3.6/site-packages (from notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: nbformat in /opt/conda/lib/python3.6/site-packages (from notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
102 103 104
      "Requirement already satisfied: Send2Trash in /opt/conda/lib/python3.6/site-packages (from notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: terminado>=0.8.1 in /opt/conda/lib/python3.6/site-packages (from notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: prometheus_client in /opt/conda/lib/python3.6/site-packages (from notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
105 106 107 108 109 110 111
      "Requirement already satisfied: bleach in /opt/conda/lib/python3.6/site-packages (from nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: pandocfilters>=1.4.1 in /opt/conda/lib/python3.6/site-packages (from nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: entrypoints>=0.2.2 in /opt/conda/lib/python3.6/site-packages (from nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: testpath in /opt/conda/lib/python3.6/site-packages (from nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: mistune>=0.7.4 in /opt/conda/lib/python3.6/site-packages (from nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.6/site-packages (from jinja2->notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /opt/conda/lib/python3.6/site-packages (from nbformat->notebook>=4.0->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
Naveen Garg's avatar
Naveen Garg committed
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
      "Requirement already satisfied: html5lib!=1.0b1,!=1.0b2,!=1.0b3,!=1.0b4,!=1.0b5,!=1.0b6,!=1.0b7,!=1.0b8,>=0.99999999pre in /opt/conda/lib/python3.6/site-packages (from bleach->nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n",
      "Requirement already satisfied: webencodings in /opt/conda/lib/python3.6/site-packages (from html5lib!=1.0b1,!=1.0b2,!=1.0b3,!=1.0b4,!=1.0b5,!=1.0b6,!=1.0b7,!=1.0b8,>=0.99999999pre->bleach->nbconvert>=4.2->jupyter-contrib-nbextensions->jupyter-require>=0.3.0->jupyter_datatables)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mYou are using pip version 9.0.3, however version 20.0.2 is available.\r\n",
      "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\r\n"
     ]
    },
    {
     "ename": "CommError",
     "evalue": "Comms haven't been initialized properly.. HINT: Try reloading <F5> the window.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mCommError\u001b[0m                                 Traceback (most recent call last)",
131
      "\u001b[0;32m<ipython-input-15-44956555a9d4>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      7\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mjupyter_datatables\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0minit_datatables_mode\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0minit_datatables_mode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
Naveen Garg's avatar
Naveen Garg committed
132 133 134 135 136 137 138
      "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/jupyter_datatables/__init__.py\u001b[0m in \u001b[0;36minit_datatables_mode\u001b[0;34m(options, classes)\u001b[0m\n\u001b[1;32m     94\u001b[0m     \u001b[0mextensions\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mconfig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdefaults\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mextensions\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     95\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 96\u001b[0;31m     \u001b[0mrequire\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"d3\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"https://d3js.org/d3.v5.min\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     97\u001b[0m     \u001b[0mrequire\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"d3-array\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"https://d3js.org/d3-array.v2.min\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     98\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/jupyter_require/core.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, library, path, *args, **kwargs)\u001b[0m\n\u001b[1;32m    110\u001b[0m         \u001b[0;34m:\u001b[0m\u001b[0mparam\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpath\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0murl\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0mto\u001b[0m \u001b[0mthe\u001b[0m \u001b[0mlibrary\u001b[0m \u001b[0mwithout\u001b[0m \u001b[0;34m.\u001b[0m\u001b[0mjs\u001b[0m \u001b[0msuffix\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    111\u001b[0m         \"\"\"\n\u001b[0;32m--> 112\u001b[0;31m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconfig\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0mlibrary\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mshim\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'shim'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m{\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    113\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    114\u001b[0m     \u001b[0;34m@\u001b[0m\u001b[0mproperty\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/opt/conda/lib/python3.6/site-packages/jupyter_require/core.py\u001b[0m in \u001b[0;36mconfig\u001b[0;34m(self, paths, shim)\u001b[0m\n\u001b[1;32m    199\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    200\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_initialized\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 201\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mCommError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Comms haven't been initialized properly.\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    202\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    203\u001b[0m         \u001b[0mRequireJS\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__LIBS\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mupdate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpaths\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mCommError\u001b[0m: Comms haven't been initialized properly.. HINT: Try reloading <F5> the window."
     ]
    }
   ],
139 140 141 142 143 144 145 146 147 148 149 150 151
   "source": [
    "# ************************************** ENVIRONMENT SETUP ***********************************************\n",
    "# Setup Python environment\n",
    "\n",
    "# install jupyter datatables for better presentation of dataframes\n",
    "!pip install jupyter_datatables\n",
    "\n",
    "from jupyter_datatables import init_datatables_mode\n",
    "init_datatables_mode()"
   ]
  },
  {
   "cell_type": "code",
152
   "execution_count": 16,
153 154 155 156 157 158 159 160 161 162 163 164 165 166
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np # linear algebra\n",
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import requests # process http requests\n",
    "from time import sleep # timer functions\n",
    "import json # process json objects\n",
    "from tqdm import tqdm # progress bar\n",
    "from hashlib import md5 # md5 hash for caching"
   ]
  },
  {
   "cell_type": "code",
167
   "execution_count": 17,
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
   "metadata": {},
   "outputs": [],
   "source": [
    "# ************************************** FUNCTION DEFINITIONS ***********************************************\n",
    "\"\"\"\n",
    "User defined Topics that forces search engine to look in their neighborhood also\n",
    "\"\"\"\n",
    "select_topics = set(['ACTIVITY', 'ADE', 'AGENT', 'ANIMAL', 'ANIMALS', 'ANTAGONIST', 'ANTIVIRAL', 'ASYMPTOMATIC',\n",
    "                     'BAT', 'BINDING', 'BUFFER', 'CELL', 'CELLS', 'CIRCULATION', 'CLARITHROMYCIN', 'CO-INFECTIONS',\n",
    "                     'CO-MORBIDITIES', 'DISEASE', 'DRUG', 'DRUGS', 'ENVIRONMENT', 'ENZYME', 'ENZYMES', 'EXPERIMENTAL',\n",
    "                     'FARMERS', 'GENOME', 'HIGH-RISK', 'HISTONE', 'HOST', 'HYDROPHILIC', 'HYDROPHOBIC', 'INFECTION',\n",
    "                     'INTERACTIONS','IMMUNE', 'LIGAND', 'LIVESTOCK', 'MINOCYCLINE', 'MODEL', 'NAGOYA', 'NAPROXEN',\n",
    "                     'NEONATES', 'NUCLEOTIDE', 'PATIENT', 'PATHOGENESIS', 'PEPTIDE', 'PEPTIDES', 'PHENOTYPE', 'PLATES',\n",
    "                     'POLYPROTEIN', 'PPE', 'PRE-EXISTING', 'PREGNANCY', 'PROTEIN', 'PROTOCOL', 'PROPHYLAXIS',\n",
    "                     'PULMONARY', 'RBD', 'RANGE', 'REAGENT', 'REAGENTS', 'RECEPTER', 'REPLICATION', 'RESIDUES', 'RESPONSE'\n",
    "                     'SEQUENCING', 'SHEDDING', 'SMOKING', 'STRAIN', 'STRUCTURES', 'THERAPEUTIC', 'TRACKING',\n",
    "                     'TRANSCRIBE', 'TRANSCRIPTASE', 'TRANSMISSION', 'TREATMENT', 'VACCINE', 'VIRAL', 'VIRUS',\n",
    "                     'WILDLIFE', 'UNIVERSAL'])\n",
    "\n",
    "\"\"\"\n",
    "Extract concepts and topics and their relationship from the search engine including user-defined topics\n",
    "\"\"\"\n",
    "def get_graph(project_name=\"cord19-dataset\", source=\"coronavirus\", target=\"transmission\", auth=\"test-key\"):\n",
    "\n",
    "    params = {\n",
    "        'auth': auth,\n",
    "        'u_name': source,\n",
    "        \"v_name\": target,\n",
    "        \"return_dataframe\": True,\n",
    "        \"additional_topics\": \",\".join(map(lambda word: word.lower(), select_topics)),\n",
    "        \"draw\": 3,\n",
    "        \"project_name\": project_name\n",
    "    }\n",
    "    r = requests.get(\"https://apis.nlpcore.com/apis/get_graph/\", params=params)\n",
    "    if r.status_code != 200:\n",
    "        raise RuntimeError(\"Failed to get_graph, please try again., %s\" % r.content)\n",
    "    dataframe = pd.DataFrame(json.loads(r.content))\n",
    "    return dataframe\n",
    "\n",
    "\"\"\"\n",
    "Filter rows in a dataframe to specific topics\n",
    "\"\"\"\n",
    "def subset_dataframe(dataframe, given_topics):\n",
    "\n",
    "    select_dataframe_rows = []\n",
    "    for _,row in dataframe.iterrows():\n",
    "        source_topics = set(row['source_topics'])\n",
    "        target_topics = set(row['target_topics'])\n",
    "        if (given_topics & source_topics or given_topics & target_topics):\n",
    "            select_dataframe_rows.append(row)\n",
    "    return pd.DataFrame(select_dataframe_rows)\n",
    "\n",
    "\"\"\"\n",
    "Get Article metadata/attributes for a given document id, store results in caches for repeated calls\n",
    "\"\"\"\n",
    "def document_metadata(project_name, document_id, auth=\"test-key\"):\n",
    "    \n",
    "    cache_key_str = \"%s-%s\" % (project_name, document_id)\n",
    "    cache_key = md5(cache_key_str.encode()).hexdigest()\n",
Naveen Garg's avatar
Naveen Garg committed
227
    "    cache_path = \"/tmp/metadata2-%s.json\" % cache_key\n",
228 229 230 231 232 233
    "\n",
    "    try:\n",
    "        return json.load(open(cache_path))\n",
    "    except FileNotFoundError:\n",
    "        pass\n",
    "\n",
Naveen Garg's avatar
Naveen Garg committed
234
    "    r = requests.get(\"https://apis.nlpcore.com/apis/get_document_metadata/\", params={'project_name': project_name,\n",
235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258
    "                                                                            'auth': auth,\n",
    "                                                                            'd': document_id})\n",
    "    if r.status_code == 200:\n",
    "        reference_data = r.json()\n",
    "        json.dump(reference_data, open(cache_path, \"w\"))\n",
    "\n",
    "    return r.json()\n",
    "\n",
    "\"\"\"\n",
    "Dataframe returned from the above calls has a list of concepts and their references. For each reference we can request \n",
    "text segments. The parameter \"r\" is a comma seperated list of integers which are senetence numbers.\n",
    "Cache references for repeated calls.\n",
    "\"\"\"\n",
    "def get_references(project_name, document_id, r, auth=\"test-key\"):\n",
    "    \n",
    "    cache_key_str = \"%s-%s-%s\" % (project_name, r, document_id)\n",
    "    cache_key = md5(cache_key_str.encode()).hexdigest()\n",
    "    cache_path = \"/tmp/%s.json\" % cache_key\n",
    "    \n",
    "    try:\n",
    "        return json.load(open(cache_path))\n",
    "    except FileNotFoundError:\n",
    "        pass\n",
    "    \n",
Naveen Garg's avatar
Naveen Garg committed
259
    "    r = requests.get(\"https://apis.nlpcore.com/apis/get_references/\", params={'project_name': project_name,\n",
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
    "                                                                             'auth': auth, 'r': r,\n",
    "                                                                             'd': document_id})\n",
    "    if r.status_code == 200:\n",
    "        reference_data = r.json()\n",
    "        json.dump(reference_data, open(cache_path, \"w\"))\n",
    "    \n",
    "    return r.json()\n",
    "\n",
    "\"\"\"\n",
    "Augment the dataframe with article and sentence references for each of the co-occuring concepts in each row\n",
    "\"\"\"\n",
    "def refine_dataframe(project_name, dataframe, auth=\"test-key\"):\n",
    "\n",
    "    select_dataframe_rows = []\n",
    "    for _,row in tqdm(list(dataframe.iterrows())):\n",
    "        source_topics = set(row['source_topics'])\n",
    "        target_topics = set(row['target_topics'])\n",
    "        if (select_topics & source_topics and select_topics & target_topics) and row['source_idf'] < 3 and row['target_idf'] < 3:            \n",
    "            reference_texts = []\n",
    "            for document_id,references in row['references'].items():\n",
    "                title = document_metadata(project_name=project_name, document_id=document_id)['title'] or \"<No Title>\"\n",
    "                sections = {}\n",
    "                for reference in references[:1]:\n",
    "                    r = \"%d,%d\" % (reference['curr_s'], reference['curr_t'])\n",
    "                    text = get_references(project_name=project_name, document_id=document_id, r=r)\n",
    "                    for section in text.values():\n",
    "                        try:\n",
    "                            section_title = section['section_title']\n",
    "                        except Exception as e:\n",
    "                            print(section)\n",
    "                            raise e\n",
    "                        try:\n",
    "                            bucket = sections[section_title]\n",
    "                        except KeyError:\n",
    "                            bucket = []\n",
    "                            sections[section_title] = bucket                        \n",
    "                        bucket.append(section['sentence'])\n",
    "                reference_texts.append({'title': title, 'sections': sections})                    \n",
    "            select_dataframe_rows.append({'source': row['u_name'], 'target': row['v_name'], 'source_types': \", \".join(source_topics),\n",
    "                                         'target_types': \", \".join(target_topics), 'count': row['count'],\n",
    "                                         'references': reference_texts})\n",
    "    return pd.DataFrame(select_dataframe_rows)\n",
    "\n",
Naveen Garg's avatar
Naveen Garg committed
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322
    "\"\"\"\n",
    "Augment the dataframe with select sentences that match keywords from task\n",
    "\"\"\"\n",
    "def search_task_words(dataframe, given_topics):\n",
    "    \n",
    "    select_dataframe_rows = []\n",
    "    given_topics = [word.lower() for word in given_topics] \n",
    "    for _,row in dataframe.iterrows():\n",
    "        matched_sentences = []\n",
    "        matched_words = []\n",
    "        for reference_obj in row['references']:\n",
    "            for section_title, sentences in reference_obj['sections'].items():\n",
    "                for sentence in sentences:\n",
    "                    matched = [word for word in given_topics if word in sentence.lower()]\n",
    "                    if matched:\n",
    "                        matched_sentences.append(sentence)\n",
    "        row['sentences'] = matched_sentences\n",
    "        select_dataframe_rows.append(row.to_dict())\n",
    "    return pd.DataFrame(select_dataframe_rows)        \n",
    "\n",
323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348
    "# ************************************** END OF FUNCTION DEFINITIONS ***********************************************"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1: What is known about transmission, incubation, and environmental stability?\n",
    "From the description of the task, we identified following key phrases as our primary search topics:\n",
    "\n",
    "* Transmission\n",
    "* Incubation\n",
    "* asymptomatic shedding\n",
    "* hydrophilic surface\n",
    "* hydrophobic surface\n",
    "* virus shedding\n",
    "* disease model\n",
    "* animal model\n",
    "* phenotype change\n",
    "* PPE effectiveness\n",
    "\n",
    "Following code block computes the dataframe that returns the most applicable result set for the same."
   ]
  },
  {
   "cell_type": "code",
349
   "execution_count": 30,
350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Filter the results to only focus on most relvant topics for this challenge\n",
    "\"\"\"\n",
    "task_topics = set(['ANIMAL', 'ANIMALS', 'ASYMPTOMATIC', 'MODEL', 'TRANSMISSION', 'INCUBATION', \n",
    "                   'SHEDDING' 'HYDROPHILIC', 'HYDROPHOBIC', 'VIRUS', 'DISEASE', 'PHENOTYPE',\n",
    "                   'PPE'])\n",
    "project_name=\"cord19-dataset\"\n",
    "source=\"coronavirus\"\n",
    "target=\"transmission\"\n",
    "auth=\"test-key\""
   ]
  },
  {
   "cell_type": "code",
367
   "execution_count": 31,
368 369 370 371 372 373 374 375 376 377 378 379 380 381
   "metadata": {
    "require": [
     "base/js/events",
     "datatables.net",
     "d3",
     "chartjs",
     "dt-config",
     "dt-components",
     "dt-graph-objects",
     "dt-toolbar",
     "dt-tooltips",
     "jupyter-datatables"
    ]
   },
Naveen Garg's avatar
Naveen Garg committed
382 383 384 385 386
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
387
      "100%|██████████| 1416/1416 [00:00<00:00, 1975.70it/s]\n"
Naveen Garg's avatar
Naveen Garg committed
388
     ]
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404
    }
   ],
   "source": [
    "\"\"\"\n",
    "Get the initial dataframe and filter it down to topics of interest and add article references\n",
    "\"\"\"\n",
    "df = get_graph(project_name, source, target, auth)\n",
    "task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)\n",
    "task1_df = search_task_words(task_df, task_topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
Naveen Garg's avatar
Naveen Garg committed
405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>source_types</th>\n",
       "      <th>target_types</th>\n",
       "      <th>count</th>\n",
       "      <th>references</th>\n",
       "      <th>sentences</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Table</td>\n",
       "      <td>Bat SARSr Coronavirus Rf1</td>\n",
       "      <td>BAT</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'SARS-Coronavirus ancestor's foot-p...</td>\n",
       "      <td>[The BLAST 78% identity value indicated that t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>S. pneumoniae infections</td>\n",
       "      <td>influenza H1N1</td>\n",
       "      <td>INFECTION</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Applications of Molecular Tools to...</td>\n",
       "      <td>[This founding virus, an influenza A H1N1, rem...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>influenza H1N1</td>\n",
       "      <td>H1N1 virus</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Applications of Molecular Tools to...</td>\n",
       "      <td>[This founding virus, an influenza A H1N1, rem...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>H1N1 virus</td>\n",
       "      <td>H3N2 virus</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Interspecies transmission and emer...</td>\n",
       "      <td>[This is especially common for pigs, in which ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>H1N1 virus</td>\n",
       "      <td>H5N1 influenza viruses</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Interspecies transmission and emer...</td>\n",
       "      <td>[Improve polymerase activity and RNA replicati...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>563</th>\n",
       "      <td>Southern China</td>\n",
       "      <td>H9N2 influenza virus lineages</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Emerging viral infections in a rap...</td>\n",
       "      <td>[On the basis of recent studies in Southern Ch...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>564</th>\n",
       "      <td>H9N2 influenza virus lineages</td>\n",
       "      <td>H9N2 viruses</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Emerging viral infections in a rap...</td>\n",
       "      <td>[On the basis of recent studies in Southern Ch...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>565</th>\n",
       "      <td>MERS-CoV infection rate</td>\n",
       "      <td>MERS-CoV transmission</td>\n",
       "      <td>INFECTION</td>\n",
       "      <td>TRANSMISSION</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'King Abdulaziz Medical City, Minis...</td>\n",
       "      <td>[Recent evidence has clearly shown that MERS-C...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>566</th>\n",
       "      <td>vaccinia virus Ankara</td>\n",
       "      <td>Chimpanzee Adenovirus</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'King Abdulaziz Medical City, Minis...</td>\n",
       "      <td>[However 4 Viral Vector Vaccines, two MVA (Mod...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>567</th>\n",
       "      <td>Chimpanzee Adenovirus</td>\n",
       "      <td>MERS-CoV virus</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'King Abdulaziz Medical City, Minis...</td>\n",
       "      <td>[However 4 Viral Vector Vaccines, two MVA (Mod...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>568 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                            source                         target  \\\n",
       "0                            Table      Bat SARSr Coronavirus Rf1   \n",
       "1         S. pneumoniae infections                 influenza H1N1   \n",
       "2                   influenza H1N1                     H1N1 virus   \n",
       "3                       H1N1 virus                     H3N2 virus   \n",
       "4                       H1N1 virus         H5N1 influenza viruses   \n",
       "..                             ...                            ...   \n",
       "563                 Southern China  H9N2 influenza virus lineages   \n",
       "564  H9N2 influenza virus lineages                   H9N2 viruses   \n",
       "565        MERS-CoV infection rate          MERS-CoV transmission   \n",
       "566          vaccinia virus Ankara          Chimpanzee Adenovirus   \n",
       "567          Chimpanzee Adenovirus                 MERS-CoV virus   \n",
       "\n",
       "    source_types  target_types  count  \\\n",
       "0            BAT         VIRUS      1   \n",
       "1      INFECTION         VIRUS      1   \n",
       "2          VIRUS         VIRUS      1   \n",
       "3          VIRUS         VIRUS      1   \n",
       "4          VIRUS         VIRUS      1   \n",
       "..           ...           ...    ...   \n",
       "563        VIRUS         VIRUS      1   \n",
       "564        VIRUS         VIRUS      1   \n",
       "565    INFECTION  TRANSMISSION      1   \n",
       "566        VIRUS         VIRUS      1   \n",
       "567        VIRUS         VIRUS      1   \n",
       "\n",
       "                                            references  \\\n",
       "0    [{'title': 'SARS-Coronavirus ancestor's foot-p...   \n",
       "1    [{'title': 'Applications of Molecular Tools to...   \n",
       "2    [{'title': 'Applications of Molecular Tools to...   \n",
       "3    [{'title': 'Interspecies transmission and emer...   \n",
       "4    [{'title': 'Interspecies transmission and emer...   \n",
       "..                                                 ...   \n",
       "563  [{'title': 'Emerging viral infections in a rap...   \n",
       "564  [{'title': 'Emerging viral infections in a rap...   \n",
       "565  [{'title': 'King Abdulaziz Medical City, Minis...   \n",
       "566  [{'title': 'King Abdulaziz Medical City, Minis...   \n",
       "567  [{'title': 'King Abdulaziz Medical City, Minis...   \n",
       "\n",
       "                                             sentences  \n",
       "0    [The BLAST 78% identity value indicated that t...  \n",
       "1    [This founding virus, an influenza A H1N1, rem...  \n",
       "2    [This founding virus, an influenza A H1N1, rem...  \n",
       "3    [This is especially common for pigs, in which ...  \n",
       "4    [Improve polymerase activity and RNA replicati...  \n",
       "..                                                 ...  \n",
       "563  [On the basis of recent studies in Southern Ch...  \n",
       "564  [On the basis of recent studies in Southern Ch...  \n",
       "565  [Recent evidence has clearly shown that MERS-C...  \n",
       "566  [However 4 Viral Vector Vaccines, two MVA (Mod...  \n",
       "567  [However 4 Viral Vector Vaccines, two MVA (Mod...  \n",
       "\n",
       "[568 rows x 7 columns]"
      ]
     },
607
     "execution_count": 32,
Naveen Garg's avatar
Naveen Garg committed
608 609 610 611
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
612 613
   "source": [
    "# Print results\n",
614
    "task1_df"
615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2: What do we know about COVID-19 risk factors?\n",
    "From the description of the task, we identified following key phrases as our primary search topics:\n",
    "\n",
    "* Smoking\n",
    "* pre-existing pulmonary disease\n",
    "* co-infections\n",
    "* co-morbidities\n",
    "* neonates\n",
    "* pregnancy\n",
    "* high-risk patient group\n",
    "\n",
    "Following code block computes the dataframe that returns the most applicable result set for the same."
   ]
  },
  {
   "cell_type": "code",
637
   "execution_count": 21,
638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666
   "metadata": {
    "require": [
     "base/js/events",
     "datatables.net",
     "d3",
     "chartjs",
     "dt-config",
     "dt-components",
     "dt-graph-objects",
     "dt-toolbar",
     "dt-tooltips",
     "jupyter-datatables"
    ]
   },
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Filter the results to only focus on most relvant topics for this challenge\n",
    "\"\"\"\n",
    "task_topics = set(['SMOKING', 'PRE-EXISTING', 'PULMONARY', 'DISEASE', 'CO-INFECTIONS', 'CO-MORBIDITIES', 'NEONATES', 'PREGNANCY', \n",
    "                   'HIGH-RISK', 'PATIENT'])\n",
    "project_name=\"cord19-dataset\"\n",
    "source=\"coronavirus\"\n",
    "target=\"disease\"\n",
    "auth=\"test-key\""
   ]
  },
  {
   "cell_type": "code",
667
   "execution_count": 22,
668
   "metadata": {},
Naveen Garg's avatar
Naveen Garg committed
669 670 671 672 673
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
674
      "100%|██████████| 561/561 [00:00<00:00, 2046.69it/s]\n"
Naveen Garg's avatar
Naveen Garg committed
675 676 677
     ]
    }
   ],
678 679 680 681 682 683
   "source": [
    "\"\"\"\n",
    "Get the initial dataframe and filter it down to topics of interest and add article references\n",
    "\"\"\"\n",
    "df = get_graph(project_name, source, target, auth)\n",
    "task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)\n",
684
    "task2_df = search_task_words(task_df, task_topics)"
685 686 687 688
   ]
  },
  {
   "cell_type": "code",
689
   "execution_count": 23,
690
   "metadata": {},
Naveen Garg's avatar
Naveen Garg committed
691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>source_types</th>\n",
       "      <th>target_types</th>\n",
       "      <th>count</th>\n",
       "      <th>references</th>\n",
       "      <th>sentences</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>SARS coronavirus</td>\n",
       "      <td>co-morbidities</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>CO-MORBIDITIES</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': '&lt;No Title&gt;', 'sections': {'Case fa...</td>\n",
       "      <td>[This bias was reduced as the epidemic progres...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>FRhK-4 cells</td>\n",
       "      <td>SARS patient</td>\n",
       "      <td>CELLS</td>\n",
       "      <td>PATIENT</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': '&lt;No Title&gt;', 'sections': {'Genetic...</td>\n",
       "      <td>[Based on the first fulllength genome sequence...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>FRhK-4 cells</td>\n",
       "      <td>SARS</td>\n",
       "      <td>CELLS</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': '&lt;No Title&gt;', 'sections': {'Cause':...</td>\n",
       "      <td>[Thus, SARS, a new emerging infectious disease...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>SARS patient</td>\n",
       "      <td>Sequence comparison-all virus isolates</td>\n",
       "      <td>PATIENT</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': '&lt;No Title&gt;', 'sections': {'Genetic...</td>\n",
       "      <td>[Based on the first fulllength genome sequence...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>SARS patient</td>\n",
       "      <td>cross-reactivity</td>\n",
       "      <td>PATIENT</td>\n",
       "      <td>ACTIVITY</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'The aetiology, origins, and diagno...</td>\n",
       "      <td>[76 A recent study showed that SARS coronaviru...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>223</th>\n",
       "      <td>BVD</td>\n",
       "      <td>coronavirus</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'CORONAVIRUS INFECTION OF THE BOVIN...</td>\n",
       "      <td>[A chicken antiserum to bovine enteric coronav...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>224</th>\n",
       "      <td>SARSCoV2</td>\n",
       "      <td>disease</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Articles Radiological findings fro...</td>\n",
       "      <td>[1 On Jan 7, 2020, a novel coronavirus, severe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225</th>\n",
       "      <td>COVID19</td>\n",
       "      <td>Wuhan Jinyintan Hospital</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>INFECTION</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Articles Radiological findings fro...</td>\n",
       "      <td>[3 Most of the initial cases of coronavirus di...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>equineinfectiousdiseases.com</td>\n",
       "      <td>NSAIDs</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>DRUGS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': '&lt;No Title&gt;', 'sections': {'Clinica...</td>\n",
       "      <td>[equineinfectiousdiseases.com.]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>Imulan</td>\n",
       "      <td>FIP</td>\n",
       "      <td>CELL</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'An update on feline infectious per...</td>\n",
       "      <td>[1 Subsequent trials with PI that excluded cat...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>228 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                           source                                  target  \\\n",
       "0                SARS coronavirus                          co-morbidities   \n",
       "1                    FRhK-4 cells                            SARS patient   \n",
       "2                    FRhK-4 cells                                    SARS   \n",
       "3                    SARS patient  Sequence comparison-all virus isolates   \n",
       "4                    SARS patient                        cross-reactivity   \n",
       "..                            ...                                     ...   \n",
       "223                           BVD                             coronavirus   \n",
       "224                      SARSCoV2                                 disease   \n",
       "225                       COVID19                Wuhan Jinyintan Hospital   \n",
       "226  equineinfectiousdiseases.com                                  NSAIDs   \n",
       "227                        Imulan                                     FIP   \n",
       "\n",
       "    source_types    target_types  count  \\\n",
       "0          VIRUS  CO-MORBIDITIES      1   \n",
       "1          CELLS         PATIENT      1   \n",
       "2          CELLS         DISEASE      1   \n",
       "3        PATIENT           VIRUS      1   \n",
       "4        PATIENT        ACTIVITY      1   \n",
       "..           ...             ...    ...   \n",
       "223      DISEASE           VIRUS      1   \n",
       "224        VIRUS         DISEASE      1   \n",
       "225      DISEASE       INFECTION      1   \n",
       "226      DISEASE           DRUGS      1   \n",
       "227         CELL         DISEASE      1   \n",
       "\n",
       "                                            references  \\\n",
       "0    [{'title': '<No Title>', 'sections': {'Case fa...   \n",
       "1    [{'title': '<No Title>', 'sections': {'Genetic...   \n",
       "2    [{'title': '<No Title>', 'sections': {'Cause':...   \n",
       "3    [{'title': '<No Title>', 'sections': {'Genetic...   \n",
       "4    [{'title': 'The aetiology, origins, and diagno...   \n",
       "..                                                 ...   \n",
       "223  [{'title': 'CORONAVIRUS INFECTION OF THE BOVIN...   \n",
       "224  [{'title': 'Articles Radiological findings fro...   \n",
       "225  [{'title': 'Articles Radiological findings fro...   \n",
       "226  [{'title': '<No Title>', 'sections': {'Clinica...   \n",
       "227  [{'title': 'An update on feline infectious per...   \n",
       "\n",
       "                                             sentences  \n",
       "0    [This bias was reduced as the epidemic progres...  \n",
       "1    [Based on the first fulllength genome sequence...  \n",
       "2    [Thus, SARS, a new emerging infectious disease...  \n",
       "3    [Based on the first fulllength genome sequence...  \n",
       "4    [76 A recent study showed that SARS coronaviru...  \n",
       "..                                                 ...  \n",
       "223  [A chicken antiserum to bovine enteric coronav...  \n",
       "224  [1 On Jan 7, 2020, a novel coronavirus, severe...  \n",
       "225  [3 Most of the initial cases of coronavirus di...  \n",
       "226                    [equineinfectiousdiseases.com.]  \n",
       "227  [1 Subsequent trials with PI that excluded cat...  \n",
       "\n",
       "[228 rows x 7 columns]"
      ]
     },
894
     "execution_count": 23,
Naveen Garg's avatar
Naveen Garg committed
895 896 897 898
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
899
   "source": [
900 901
    "# Print results\n",
    "task2_df"
902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 3: What do we know about virus genetics, origin, and evolution?\n",
    "From the description of the task, we identified following key phrases as our primary search topics:\n",
    "\n",
    "* Genome tracking\n",
    "* strain circulation\n",
    "* Nagoya Protocol\n",
    "* livestock\n",
    "* recepter binding\n",
    "* farmers\n",
    "* wildlife\n",
    "* host range\n",
    "* experimental infection\n",
    "* animal host\n",
    "\n",
    "Following code block computes the dataframe that returns the most applicable result set for the same."
   ]
  },
  {
   "cell_type": "code",
927
   "execution_count": 24,
928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Filter the results to only focus on most relvant topics for this challenge\n",
    "\"\"\"\n",
    "task_topics = set(['GENOME', 'TRACKING', 'STRAIN', 'CIRCULATION', 'NAGOYA', 'LIVESTOCK', 'RECEPTER', 'BINDING', \n",
    "                   'FARMERS' 'WILDLIFE', 'HOST', 'RANGE', 'EXPERIMENTAL', 'INFECTION', 'ANIMAL', 'PROTOCOL'\n",
    "                   'HOST'])\n",
    "project_name=\"cord19-dataset\"\n",
    "source=\"coronavirus\"\n",
    "target=\"strain\"\n",
    "auth=\"test-key\""
   ]
  },
  {
   "cell_type": "code",
945
   "execution_count": 25,
946
   "metadata": {},
Naveen Garg's avatar
Naveen Garg committed
947 948 949 950 951
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
952
      "100%|██████████| 5630/5630 [00:03<00:00, 1694.81it/s]\n"
Naveen Garg's avatar
Naveen Garg committed
953
     ]
954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969
    }
   ],
   "source": [
    "\"\"\"\n",
    "Get the initial dataframe and filter it down to topics of interest and add article references\n",
    "\"\"\"\n",
    "df = get_graph(project_name, source, target, auth)\n",
    "task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)\n",
    "task3_df = search_task_words(task_df, task_topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
Naveen Garg's avatar
Naveen Garg committed
970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>source_types</th>\n",
       "      <th>target_types</th>\n",
       "      <th>count</th>\n",
       "      <th>references</th>\n",
       "      <th>sentences</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
1003 1004
       "      <td>RealArt HPA coronavirus RT-PCR</td>\n",
       "      <td>Tor2 strain</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1005 1006
       "      <td>VIRUS</td>\n",
       "      <td>STRAIN</td>\n",
1007 1008 1009
       "      <td>1</td>\n",
       "      <td>[{'title': 'Comprehensive detection and identi...</td>\n",
       "      <td>[Column three shows the average (on two experi...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1010 1011 1012
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
1013 1014
       "      <td>OC43 strains</td>\n",
       "      <td>virus stocks</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1015 1016
       "      <td>STRAIN</td>\n",
       "      <td>STRAIN</td>\n",
1017 1018 1019
       "      <td>1</td>\n",
       "      <td>[{'title': 'Comprehensive detection and identi...</td>\n",
       "      <td>[Human coronaviruses OC43 and 229E strains wer...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1020 1021 1022
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
1023 1024 1025
       "      <td>OC43 strains</td>\n",
       "      <td>FIPV</td>\n",
       "      <td>STRAIN</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1026
       "      <td>VIRUS</td>\n",
1027 1028 1029
       "      <td>3</td>\n",
       "      <td>[{'title': 'Viral Diseases*', 'sections': {'Di...</td>\n",
       "      <td>[In a complement fixation test, antigens in in...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1030 1031 1032
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
1033 1034 1035
       "      <td>OC43 strains</td>\n",
       "      <td>OC43 viruses</td>\n",
       "      <td>STRAIN</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1036 1037
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
1038 1039
       "      <td>[{'title': 'Epidemiology, Genetic Recombinatio...</td>\n",
       "      <td>[A study in France, between 2001 and 2013, dem...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1040 1041 1042
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
1043 1044 1045
       "      <td>virus stocks</td>\n",
       "      <td>IBV</td>\n",
       "      <td>STRAIN</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1046 1047
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
1048 1049
       "      <td>[{'title': 'Comprehensive detection and identi...</td>\n",
       "      <td>[Tissue culture-adapted infectious bronchitis ...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
1062 1063 1064
       "      <th>2723</th>\n",
       "      <td>strain MHV-S</td>\n",
       "      <td>et Kienzle</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1065
       "      <td>STRAIN</td>\n",
1066
       "      <td>VIRUS</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1067 1068
       "      <td>1</td>\n",
       "      <td>[{'title': 'Coronavirus Pathogenesis', 'sectio...</td>\n",
1069
       "      <td>[The genome of the tissue culture-adapted A59 ...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1070 1071
       "    </tr>\n",
       "    <tr>\n",
1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082
       "      <th>2724</th>\n",
       "      <td>MHV-S strain</td>\n",
       "      <td>esterase substrate</td>\n",
       "      <td>STRAIN</td>\n",
       "      <td>LIGAND</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Coronavirus Pathogenesis', 'sectio...</td>\n",
       "      <td>[A caveat to this result is that expression of...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2725</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1083 1084 1085 1086 1087 1088 1089 1090 1091
       "      <td>strain JHM.SD</td>\n",
       "      <td>17Cl-1 cells</td>\n",
       "      <td>STRAIN</td>\n",
       "      <td>CELLS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Coronavirus Pathogenesis', 'sectio...</td>\n",
       "      <td>[In addition, N has been reported to associate...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
1092
       "      <th>2726</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1093 1094 1095 1096 1097 1098 1099 1100 1101
       "      <td>SARS-CoV nsp13-maltose binding protein</td>\n",
       "      <td>MBP</td>\n",
       "      <td>BINDING</td>\n",
       "      <td>PROTEIN</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Coronavirus Pathogenesis', 'sectio...</td>\n",
       "      <td>[A histidine-tagged form of the alphacoronavir...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
1102
       "      <th>2727</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1103 1104 1105 1106 1107 1108 1109 1110 1111 1112
       "      <td>orf6a protein</td>\n",
       "      <td>SARS-CoV genome</td>\n",
       "      <td>PROTEIN</td>\n",
       "      <td>GENOME</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Coronavirus Pathogenesis', 'sectio...</td>\n",
       "      <td>[The orf6a protein was further demonstrated to...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
1113
       "<p>2728 rows × 7 columns</p>\n",
Naveen Garg's avatar
Naveen Garg committed
1114 1115 1116
       "</div>"
      ],
      "text/plain": [
1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128
       "                                      source              target source_types  \\\n",
       "0             RealArt HPA coronavirus RT-PCR         Tor2 strain        VIRUS   \n",
       "1                               OC43 strains        virus stocks       STRAIN   \n",
       "2                               OC43 strains                FIPV       STRAIN   \n",
       "3                               OC43 strains        OC43 viruses       STRAIN   \n",
       "4                               virus stocks                 IBV       STRAIN   \n",
       "...                                      ...                 ...          ...   \n",
       "2723                            strain MHV-S          et Kienzle       STRAIN   \n",
       "2724                            MHV-S strain  esterase substrate       STRAIN   \n",
       "2725                           strain JHM.SD        17Cl-1 cells       STRAIN   \n",
       "2726  SARS-CoV nsp13-maltose binding protein                 MBP      BINDING   \n",
       "2727                           orf6a protein     SARS-CoV genome      PROTEIN   \n",
Naveen Garg's avatar
Naveen Garg committed
1129 1130
       "\n",
       "     target_types  count                                         references  \\\n",
1131 1132 1133 1134 1135
       "0          STRAIN      1  [{'title': 'Comprehensive detection and identi...   \n",
       "1          STRAIN      1  [{'title': 'Comprehensive detection and identi...   \n",
       "2           VIRUS      3  [{'title': 'Viral Diseases*', 'sections': {'Di...   \n",
       "3           VIRUS      1  [{'title': 'Epidemiology, Genetic Recombinatio...   \n",
       "4           VIRUS      1  [{'title': 'Comprehensive detection and identi...   \n",
Naveen Garg's avatar
Naveen Garg committed
1136
       "...           ...    ...                                                ...   \n",
1137 1138 1139 1140 1141
       "2723        VIRUS      1  [{'title': 'Coronavirus Pathogenesis', 'sectio...   \n",
       "2724       LIGAND      1  [{'title': 'Coronavirus Pathogenesis', 'sectio...   \n",
       "2725        CELLS      1  [{'title': 'Coronavirus Pathogenesis', 'sectio...   \n",
       "2726      PROTEIN      1  [{'title': 'Coronavirus Pathogenesis', 'sectio...   \n",
       "2727       GENOME      1  [{'title': 'Coronavirus Pathogenesis', 'sectio...   \n",
Naveen Garg's avatar
Naveen Garg committed
1142 1143
       "\n",
       "                                              sentences  \n",
1144 1145 1146 1147 1148
       "0     [Column three shows the average (on two experi...  \n",
       "1     [Human coronaviruses OC43 and 229E strains wer...  \n",
       "2     [In a complement fixation test, antigens in in...  \n",
       "3     [A study in France, between 2001 and 2013, dem...  \n",
       "4     [Tissue culture-adapted infectious bronchitis ...  \n",
Naveen Garg's avatar
Naveen Garg committed
1149
       "...                                                 ...  \n",
1150 1151 1152 1153 1154
       "2723  [The genome of the tissue culture-adapted A59 ...  \n",
       "2724  [A caveat to this result is that expression of...  \n",
       "2725  [In addition, N has been reported to associate...  \n",
       "2726  [A histidine-tagged form of the alphacoronavir...  \n",
       "2727  [The orf6a protein was further demonstrated to...  \n",
Naveen Garg's avatar
Naveen Garg committed
1155
       "\n",
1156
       "[2728 rows x 7 columns]"
Naveen Garg's avatar
Naveen Garg committed
1157 1158
      ]
     },
1159
     "execution_count": 26,
Naveen Garg's avatar
Naveen Garg committed
1160 1161 1162 1163
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
1164 1165
   "source": [
    "# Print results\n",
1166
    "task3_df"
1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 4: What do we know about vaccines and therapeutics?\n",
    "From the description of the task, we identified following key phrases as our primary search topics:\n",
    "\n",
    "* naproxen\n",
    "* clarithromycin\n",
    "* minocycline\n",
    "* Antibody Dependent Enhancement (ADE)\n",
    "* therapeutic\n",
    "* antiviral agent\n",
    "* universal vaccine\n",
    "* prophylaxis (preventative)\n",
    "* vaccine immune response\n",
    "\n",
    "Following code block computes the dataframe that returns the most applicable result set for the same."
   ]
  },
  {
   "cell_type": "code",
1191
   "execution_count": 27,
1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207
   "metadata": {},
   "outputs": [],
   "source": [
    "\"\"\"\n",
    "Filter the results to only focus on most relvant topics for this challenge\n",
    "\"\"\"\n",
    "task_topics = set(['NAPROXEN', 'CLARITHROMYCIN', 'MINOCYCLINE', 'ADE', 'THERAPEUTIC', 'ANTIVIRAL', 'AGENT', 'UNIVERSAL'\n",
    "                  'VACCINE', 'PROPHYLAXIS', 'IMMUNE', 'RESPONSE'])\n",
    "project_name=\"cord19-dataset\"\n",
    "source=\"coronavirus\"\n",
    "target=\"vaccine\"\n",
    "auth=\"test-key\""
   ]
  },
  {
   "cell_type": "code",
1208
   "execution_count": 28,
1209
   "metadata": {},
Naveen Garg's avatar
Naveen Garg committed
1210 1211 1212 1213 1214
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
1215
      "100%|██████████| 305/305 [00:00<00:00, 1613.40it/s]\n"
Naveen Garg's avatar
Naveen Garg committed
1216
     ]
1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232
    }
   ],
   "source": [
    "\"\"\"\n",
    "Get the initial dataframe and filter it down to topics of interest and add article references\n",
    "\"\"\"\n",
    "df = get_graph(project_name, source, target, auth)\n",
    "task_df = refine_dataframe(project_name, subset_dataframe(df, task_topics), auth)\n",
    "task4_df = search_task_words(task_df, task_topics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
Naveen Garg's avatar
Naveen Garg committed
1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>source</th>\n",
       "      <th>target</th>\n",
       "      <th>source_types</th>\n",
       "      <th>target_types</th>\n",
       "      <th>count</th>\n",
       "      <th>references</th>\n",
1260
       "      <th>sentences</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1261 1262 1263 1264 1265
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
1266 1267 1268 1269 1270 1271 1272 1273 1274 1275
       "      <td>IBV</td>\n",
       "      <td>S1 glycoprotein amino acid sequence relatedness</td>\n",
       "      <td>AGENT</td>\n",
       "      <td>PROTEIN</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Molecular evolution and emergence ...</td>\n",
       "      <td>[The causative agent, IBV, has also been found...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1276 1277 1278 1279 1280 1281
       "      <td>subtype</td>\n",
       "      <td>Wadey</td>\n",
       "      <td>STRAIN</td>\n",
       "      <td>ADE</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Full genome analysis of Australian...</td>\n",
1282
       "      <td>[Four different Australian strains of IBV were...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1283 1284
       "    </tr>\n",
       "    <tr>\n",
1285
       "      <th>2</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1286 1287 1288 1289 1290 1291
       "      <td>Wadey</td>\n",
       "      <td>et Lougovskaia</td>\n",
       "      <td>ADE</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Full genome analysis of Australian...</td>\n",
1292
       "      <td>[Four different Australian strains of IBV were...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1293 1294 1295
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
1296 1297 1298 1299
       "      <td>borreliosis</td>\n",
       "      <td>B. henselae</td>\n",
       "      <td>DISEASE</td>\n",
       "      <td>AGENT</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1300
       "      <td>1</td>\n",
1301 1302
       "      <td>[{'title': 'Current Clinical Applications of M...</td>\n",
       "      <td>[Other species have been found in isolated cas...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1303 1304 1305
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
1306 1307
       "      <td>SARS coronavirus</td>\n",
       "      <td>coronavirus OC43 shares</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1308
       "      <td>AGENT</td>\n",
1309
       "      <td>VIRUS</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1310
       "      <td>1</td>\n",
1311 1312
       "      <td>[{'title': 'Respiratory Research Molecular mec...</td>\n",
       "      <td>[Parallel to the progress made in the epidemio...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1313 1314 1315 1316 1317 1318 1319 1320 1321
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
1322
       "      <td>...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1323 1324
       "    </tr>\n",
       "    <tr>\n",
1325
       "      <th>141</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1326 1327 1328 1329 1330 1331
       "      <td>pathogen eradicated-variola virus</td>\n",
       "      <td>agent</td>\n",
       "      <td>AGENT</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Vaccines for Emerging Viral Diseas...</td>\n",
1332
       "      <td>[Despite the significant impact of antimicrobi...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1333 1334
       "    </tr>\n",
       "    <tr>\n",
1335
       "      <th>142</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1336 1337 1338 1339 1340 1341
       "      <td>super-spreaders</td>\n",
       "      <td>Ebolavirus</td>\n",
       "      <td>ADE</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Vaccines for Emerging Viral Diseas...</td>\n",
1342
       "      <td>[It has a reproductive rate (R 0 ) of &lt;1 for p...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1343 1344
       "    </tr>\n",
       "    <tr>\n",
1345
       "      <th>143</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1346 1347 1348 1349 1350 1351
       "      <td>E2 glycoproteins</td>\n",
       "      <td>ECSA clade</td>\n",
       "      <td>PROTEIN</td>\n",
       "      <td>ADE</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Vaccines for Emerging Viral Diseas...</td>\n",
1352
       "      <td>[Neutralizing antibodies against an outbreak s...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1353 1354
       "    </tr>\n",
       "    <tr>\n",
1355
       "      <th>144</th>\n",
Naveen Garg's avatar
Naveen Garg committed
1356 1357 1358 1359 1360 1361
       "      <td>HKU5 coronaviruses</td>\n",
       "      <td>super-spreaders</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>ADE</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Vaccines for Emerging Viral Diseas...</td>\n",
1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372
       "      <td>[The outbreak in Seoul was primarily restricte...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>145</th>\n",
       "      <td>S-adenosyl-L-methionine</td>\n",
       "      <td>remainder</td>\n",
       "      <td>ADE</td>\n",
       "      <td>VIRUS</td>\n",
       "      <td>1</td>\n",
       "      <td>[{'title': 'Recombination in Avian Gamma-Coron...</td>\n",
       "      <td>[The nsp16 is reported to be an S-adenosyl-L-m...</td>\n",
Naveen Garg's avatar
Naveen Garg committed
1373 1374 1375
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
1376
       "<p>146 rows × 7 columns</p>\n",
Naveen Garg's avatar
Naveen Garg committed
1377 1378 1379
       "</div>"
      ],
      "text/plain": [
1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404
       "                                source  \\\n",
       "0                                  IBV   \n",
       "1                              subtype   \n",
       "2                                Wadey   \n",
       "3                          borreliosis   \n",
       "4                     SARS coronavirus   \n",
       "..                                 ...   \n",
       "141  pathogen eradicated-variola virus   \n",
       "142                    super-spreaders   \n",
       "143                   E2 glycoproteins   \n",
       "144                 HKU5 coronaviruses   \n",
       "145            S-adenosyl-L-methionine   \n",
       "\n",
       "                                              target source_types  \\\n",
       "0    S1 glycoprotein amino acid sequence relatedness        AGENT   \n",
       "1                                              Wadey       STRAIN   \n",
       "2                                     et Lougovskaia          ADE   \n",
       "3                                        B. henselae      DISEASE   \n",
       "4                            coronavirus OC43 shares        AGENT   \n",
       "..                                               ...          ...   \n",
       "141                                            agent        AGENT   \n",
       "142                                       Ebolavirus          ADE   \n",
       "143                                       ECSA clade      PROTEIN   \n",
       "144                                  super-spreaders        VIRUS   \n",
       "145                                        remainder          ADE   \n",
Naveen Garg's avatar
Naveen Garg committed
1405
       "\n",
1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430
       "    target_types  count                                         references  \\\n",
       "0        PROTEIN      1  [{'title': 'Molecular evolution and emergence ...   \n",
       "1            ADE      1  [{'title': 'Full genome analysis of Australian...   \n",
       "2          VIRUS      1  [{'title': 'Full genome analysis of Australian...   \n",
       "3          AGENT      1  [{'title': 'Current Clinical Applications of M...   \n",
       "4          VIRUS      1  [{'title': 'Respiratory Research Molecular mec...   \n",
       "..           ...    ...                                                ...   \n",
       "141        VIRUS      1  [{'title': 'Vaccines for Emerging Viral Diseas...   \n",
       "142        VIRUS      1  [{'title': 'Vaccines for Emerging Viral Diseas...   \n",
       "143          ADE      1  [{'title': 'Vaccines for Emerging Viral Diseas...   \n",
       "144          ADE      1  [{'title': 'Vaccines for Emerging Viral Diseas...   \n",
       "145        VIRUS      1  [{'title': 'Recombination in Avian Gamma-Coron...   \n",
       "\n",
       "                                             sentences  \n",
       "0    [The causative agent, IBV, has also been found...  \n",
       "1    [Four different Australian strains of IBV were...  \n",
       "2    [Four different Australian strains of IBV were...  \n",
       "3    [Other species have been found in isolated cas...  \n",
       "4    [Parallel to the progress made in the epidemio...  \n",
       "..                                                 ...  \n",
       "141  [Despite the significant impact of antimicrobi...  \n",
       "142  [It has a reproductive rate (R 0 ) of <1 for p...  \n",
       "143  [Neutralizing antibodies against an outbreak s...  \n",
       "144  [The outbreak in Seoul was primarily restricte...  \n",
       "145  [The nsp16 is reported to be an S-adenosyl-L-m...  \n",
Naveen Garg's avatar
Naveen Garg committed
1431
       "\n",
1432
       "[146 rows x 7 columns]"
Naveen Garg's avatar
Naveen Garg committed
1433 1434
      ]
     },
1435
     "execution_count": 29,
Naveen Garg's avatar
Naveen Garg committed
1436 1437 1438 1439
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
1440 1441
   "source": [
    "# Print results\n",
1442
    "task4_df"
1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## END OF FILE"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "require": {
   "paths": {
    "buttons.colvis": "https://cdn.datatables.net/buttons/1.5.6/js/buttons.colVis.min",
    "buttons.flash": "https://cdn.datatables.net/buttons/1.5.6/js/buttons.flash.min",
    "buttons.html5": "https://cdn.datatables.net/buttons/1.5.6/js/buttons.html5.min",
    "buttons.print": "https://cdn.datatables.net/buttons/1.5.6/js/buttons.print.min",
    "chartjs": "https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.8.0/Chart",
    "d3": "https://d3js.org/d3.v5.min",
    "d3-array": "https://d3js.org/d3-array.v2.min",
    "datatables.net": "https://cdn.datatables.net/1.10.18/js/jquery.dataTables",
    "datatables.net-buttons": "https://cdn.datatables.net/buttons/1.5.6/js/dataTables.buttons.min",
    "datatables.responsive": "https://cdn.datatables.net/responsive/2.2.2/js/dataTables.responsive.min",
    "datatables.scroller": "https://cdn.datatables.net/scroller/2.0.0/js/dataTables.scroller.min",
    "datatables.select": "https://cdn.datatables.net/select/1.3.0/js/dataTables.select.min",
    "jszip": "https://cdnjs.cloudflare.com/ajax/libs/jszip/2.5.0/jszip.min",
    "moment": "https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.8.0/moment",
    "pdfmake": "https://cdnjs.cloudflare.com/ajax/libs/pdfmake/0.1.36/pdfmake.min",
    "vfsfonts": "https://cdnjs.cloudflare.com/ajax/libs/pdfmake/0.1.36/vfs_fonts"
   },
   "shim": {
    "buttons.colvis": {
     "deps": [
      "jszip",
      "datatables.net-buttons"
     ]
    },
    "buttons.flash": {
     "deps": [
      "jszip",
      "datatables.net-buttons"
     ]
    },
    "buttons.html5": {
     "deps": [
      "jszip",
      "datatables.net-buttons"
     ]
    },
    "buttons.print": {
     "deps": [
      "jszip",
      "datatables.net-buttons"
     ]
    },
    "chartjs": {
     "deps": [
      "moment"
     ]
    },
    "datatables.net": {
     "exports": "$.fn.dataTable"
    },
    "datatables.net-buttons": {
     "deps": [
      "datatables.net"
     ]
    },
    "pdfmake": {
     "deps": [
      "datatables.net"
     ]
    },
    "vfsfonts": {
     "deps": [
      "datatables.net"
     ]
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}