Senior Data Engineer, R&D

BenchSci

Category
Health
Location
Toronto
Remote in North America
Job Function
Data
Seniority
Senior
Toronto | Data | Senior

We are currently seeking a Senior Data Engineer, R&D to join our Machine Learning team. Reporting to the Engineering Manager, you will work on creating the data infrastructure that supports BenchSci’s supervised learning pipeline for the R&D problems that we are solving. In this role, you will work closely with domain experts and ML engineers in the earliest stages of building a new feature through domain modelling, data preparation, feature engineering, and rapid prototyping of heuristics or baseline models.

Success will be measured by the continuous improvement of our model quality through a data-centric approach to model training, as well as the velocity with which we can ship new R&D features. 

You will:

– Collaborate closely with ML and domain experts to solve interesting and challenging problems with respect to extracting ground truth data to train high-quality models

– Employ best practices in modern machine learning workflows within a cloud-based environment

– Set the baseline performance to beat by the rapid development of specialized heuristics or baseline models

– Analyze and evaluate our data sets across the ML lifecycle to ensure they are fit for purpose for both labeling and model training

– Work on projects involving some of the largest pharmaceutical companies in the world

– Provide troubleshooting analysis and resolution in a timely manner

– Have opportunities to work both independently and in pair-programming settings 

– Be given an unmatched opportunity for growth and development, and to learn from a team of outstanding engineers

You have:

– 4+ years of experience working as a professional developer

– Expertise in Python and programming fundamentals

– Expertise in intermediate/advanced SQL and BigQuery or similar serverless data warehousing solutions

– Experience with statistical analysis of datasets

– Experience with cloud reference architectures for common patterns in data pipelines

– Strong cross-team communication and collaboration skills

Nice to haves, but not mandatory qualifications:

– A background in Life Science

– Working knowledge of data versioning tools such as DVC for machine learning

– Working knowledge of distributed systems and data processing fundamentals

– Knowledge of distributed data processing abstractions like Beam or Spark

– Working knowledge of machine learning data fundamentals such as data splits, training-serving skew, common data representations such as embeddings or multi-hot encodings, sampling strategies for active learning

– Working knowledge of how to evaluate classification model quality, such as precision, recall, F1, PR/ROC curves

Apply Now

Is this posting closed? Report a Dead Link

We do our best to remove postings when they're taken down, but as a small team we sometimes miss a few. Thank you for helping us stay current.