Experience
Senior ML Engineer
March 2024 - Present
- Developing LLMs for Cloud Code Completion in JetBrains AI.
- Lead the LLM training project, responsible for data processing, pre-training, fine-tuning, and alignment.
- Trained models that are deployed for use in JetBrains’ high-selling IDEs, serving millions of users worldwide, and are a core feature of the JetBrains AI product.
Projects
Mellum: JetBrains LLM For Developers
October 2024
Contributed to Mellum, JetBrains' proprietary large language model specifically built for developers. The model is designed to deeply understand code, program semantics, and development workflows to provide intelligent coding assistance.
Toloka LLM Leaderboard
July 2023
Developed the Toloka LLM Leaderboard, a comprehensive benchmarking tool for evaluating open large language models through human evaluations. This platform enables reliable comparison of model performance using crowdsourced assessments.
Crowd-Kit Python Library
May 2022
Created Crowd-Kit, an open-source Python library for crowdsourced data aggregation. The library implements various data annotation consolidation methods for classification, regression, ranking, and pairwise comparison tasks, enhancing data quality for ML applications.
CrowdSpeech Dataset
July 2021
Developed the CrowdSpeech dataset, a collection of crowdsourced audio transcriptions from non-professional workers. This resource provides valuable data for training and evaluating speech recognition systems and studying annotation quality control methods.
Publications
My research contributions span multiple domains including machine learning, crowdsourcing, and AI-generated content. You can find my complete publication history on Google Scholar.
Best Prompts for Text-to-Image Models and How to Find Them
SIGIR 2023
A novel approach for optimizing text prompts for text-to-image generation models using crowdsourcing techniques and evolutionary algorithms.
CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription
NeurIPS Datasets and Benchmarks 2021
A benchmark dataset for evaluating crowdsourced audio transcription methods, featuring diverse languages and recording conditions.
Spherical convolutions on molecular graphs for protein model quality assessment
Machine Learning: Science and Technology 2021
A deep learning model operating on molecular graphs (S-GCN) for protein model quality prediction that achieved state-of-the-art results on the CASP MQA challenge.
Contact
Let's Connect
I'm always open to discussing new projects, opportunities, or partnerships. Feel free to reach out through any of these channels!
Need my resume?
Download for complete details on my experience and skills.