CEA vacancy search engine

Natural language processing for genomics H/F

Détail de l'offre

Informations générales

Entité de rattachement

Le Commissariat à l'énergie atomique et aux énergies alternatives (CEA) est un organisme public de recherche.

Acteur majeur de la recherche, du développement et de l'innovation, le CEA intervient dans le cadre de ses quatre missions :
. la défense et la sécurité
. l'énergie nucléaire (fission et fusion)
. la recherche technologique pour l'industrie
. la recherche fondamentale (sciences de la matière et sciences de la vie).

Avec ses 16000 salariés -techniciens, ingénieurs, chercheurs, et personnel en soutien à la recherche- le CEA participe à de nombreux projets de collaboration aux côtés de ses partenaires académiques et industriels.  



Description de l'unité

The person hired will integrate an interdisciplinary team (both NLP and genomics) aiming that rely on predictive and generative artificial intelligence for biology by exploiting deep contextual language models of biological sequences, which representations generalize to several applications like the prediction of mutational effects.

About the LASTI lab:

About Genoscope:

Description du poste


Mathématiques, information  scientifique, logiciel



Intitulé de l'offre

Natural language processing for genomics H/F

Statut du poste


Durée du contrat (en mois)


Description de l'offre


Exponential growth in sequencing throughput together with the sampling of natural (uncultured) populations are providing a deeper view of the diversity of proteins sequences across the tree of life. Proteins are molecular engines sustaining cellular life and the unobserved determinants of their structure and function are encoded in the distribution of observed natural sequences. Therefore, such vast amounts of (unlabelled) sequences provide evolutionary data that can form the ground for unsupervised learning of predictive and generative models of biological function.

Our focus here will be to train high-capacity Transformer-based language models on sequence data, in a way analogous to what is done in natural language understanding, where the semantics of words is determined from the contexts in which they appear in sentences. Intrinsic organizing principles captured in the resulting representations can then be applied in transfer learning settings to different prediction sub-tasks using limited experimental data, like the effect of sequence variation on function. Following promising recent results, we plan to also explore zero-shot inference with no additional training and/or supervision from experimental data.

This project will be an excellent opportunity for a candidate who is looking to contribute to cutting-edge research and to train with experts in the field. We are seeking a detail-oriented computer scientist and problem solver passionate in science.


* Tune and optimize existing unsupervised transformer-based language models for protein sequences.
* Develop and optimize code and machine learning algorithms for predictive models.
* Integrate and analyze large data volumes.
* Interact continuously with scientists in an interdisciplinary team.



This 2 years position is open to a range of candidates from recent college graduates to more experienced scientists (e.g. post-docs) – the chosen candidate's salary will be commensurate with their level of education, skills, and experience. Other benefits include:
- 48 days of paid holidays
- on-site subsidized restaurant
- partial remote work is possible, up to 3 days per week and 100 days per year
- CEA contribution to the personal company savings plan

Profil du candidat


* Ph.D. or M.Sc. in a quantitative discipline, e.g. Applied Mathematics, Computer Science, Computational Biology, Physics or a closely related discipline.
* Experience with Python, open-source software libraries for machine learning and Linux (file systems, shell, hardware/software monitoring, etc).
* Strong mathematical background and analytical skills.
* Effective organizational skills, e.g. the ability to prioritize work and contribute to the planning of a program of scientific research.
* Demonstrated interpersonal skills including both the ability to work independently and perform collaborative research in an interdisciplinary team environment.
* Good oral and written communication skills.

Preferred: Previous experience with transformer-based techniques for NLP pre-training and unsupervised transformer language models

Localisation du poste



Localisation du poste

France, Ile-de-France, Essonne (91)



Critères candidat


  • Anglais (Courant)
  • Français (Notions)

Formation recommandée

Ph.D. or M.Sc. in a quantitative discipline, e.g. Applied Mathematics, Computer Science, Computation


Disponibilité du poste