Project Description:
a. Project Name: Information Science / Data Science – Chemical Formulas and Compounds Internship
b. Scope: Find information on how to extract genomic sequences and bio-chemical
information from source documents. The intern will work on the backend side to build a parser
for chemical formulas and compounds to feed into the system.
c. Project Description: Our AI is working on mass-data ingestion and we would
like to explore new routes to identify genomic sequences in our data sources to
extract them without blurring/altering the data source.
i. Some initial orientation about machine-learning and genomics can be
found here: https://codete.com/blog/machine-learning-genomics/
ii. We typically extract data for our AI from scientific publications and
patent documents. These will be the target data sources.
iii. Just running an OCR over the documents (if they are PDFs) will destroy
the sequences or change their meaning/content. We need to find a way to
persist the extracted data in a database/library/collection which our
algorithms can then query.
d. Form of Delivery: Periodic updates by email and a final report on your findings
in Word format
Business Purpose: We need this information to improve the precision of our AI.
Duration: part-time, minimum 7 hours per week
Work Hours: Flexible (intern can work at any time, including nighttime or on weekend)
Location: Remote (Intern can live anywhere in the world.)
Primary Work Premise: Home
Compensation: $15 (US Dollar) per hour
Consideration for full-time employment: Yes
Training Provided: Yes.
Travel Required: No
Submission Requirements:
Please upload resume (in English)
Interview format: Video interview via Zoom or Microsoft Teams will be arranged upon
selection notification
Market Skills and Requirements
1. Requirement – Chemistry, chemical engineering or biochemistry (major or minor)
AND
2. Requirement – Library and Information Sciences, Information Sciences, Information Studies, Information Systems, bioinformatics, computer science or equivalent (major or minor)