General information
Organisation
The French Alternative Energies and Atomic Energy Commission (CEA) is a key player in research, development and innovation in four main areas :
• defence and security,
• nuclear energy (fission and fusion),
• technological research for industry,
• fundamental research in the physical sciences and life sciences.
Drawing on its widely acknowledged expertise, and thanks to its 16000 technicians, engineers, researchers and staff, the CEA actively participates in collaborative projects with a large number of academic and industrial partners.
The CEA is established in ten centers spread throughout France
Reference
2020-15229
Description de l'unité
The French Atomic Energy Commission (CEA) is a major European player in research, development and innovation. This technological research organisation is active in four main areas: energy, information technologies, health and defence. Located on the Saclay campus in the south of Ile de France, the « Laboratoire d'Intégration des Systèmes et des Technologies » (LIST) designs intelligent digital systems. Its 750 employees support 200 French and foreign companies every year on applied research projects in four areas: Advanced Manufacturing, Embedded Systems, Data Intelligence and Radiation Control for Health. The CEA LIST teams are also partners with numerous university laboratories, « grandes écoles » and other research organisations through collaborative research projects. They are part of DIGITEO LABS, a scientific campus which brings together more than 1200 researchers specialised in information technologies (CEA, INRIA, CNRS, Supélec, Ecole Polytechnique, Université d'Orsay). Within this institute, the SID (Data Intelligence Service) develops Artificial Intelligence solutions for decision support, user-oriented and automatic supervision of complex systems.
Position description
Category
Information system
Contract
Internship
Job title
Internship - Distributed data stream learning in a collaborative environment H/F
Subject
Data streams are defined as infinite streams of data integrated from both live and historical sources. In such scenarios, data stream processing algorithms must satisfy requirements as bounded storage, single pass (data is going to be processed just once), real time and concept drift. On the other hand, federated learning is a technique that enables to train an algorithm across multiple decentralized edge devices or servers holding local data samples without exchanging them [4].
Contract duration (months)
5 mois
Job description
Nowadays the number of data stream applications and domains, where dynamism and speed truly matter, is increasing. In practice those streams represent dynamic data flows, coming from different sources, where their content evolves in time. Moreover, the complexity of current digital applications, and those of the near future, is constantly increasing due to a combination of aspects such as the large number of sources, the non-linearity of certain processes, the distribution of knowledge and control, the time response, the strong dynamics of its environment, the unpredictability of interactions, etc. The aforementioned complexity opens new research challenges about the generation and processing (learning) of these streams, especially in distributed, heterogeneous and collaborative environments. This claims for distributed solutions capable of processing and learning from data in a decentralized way as, for instance, federated learning. Furthermore, distributed learning brings several advantages as reducing the computational complexity (since it is shared among different nodes), minimizing the load and cost of communications (transmission of parameters instead of the full data), or increasing the security (knowledge is shared among different machines).
Based on the previous motivations, the goal of this internship is to implement a decentralized federated algorithm applied to distributed streaming contexts. For this, the student will rely on STREAMER. Developed in LI3A laboratory (CEA List) premises, STREAMER is a cutting-edge data stream processing (Complex Event Processing) framework devoted to analyzing sequential data for electrical or industrial systems. For the completion of this internship, the student will work on three main tasks:
- Adapt STREAMER to work in a distributed fashion (communications need to be implemented, kernel preparation, etc.).
- Propose and implement in STREAMER a distributed and decentralized algorithm (based on federated learning) that also deals with streaming challenges (incremental learning, no data storage, single pass, concept drift…).
- Start exploring the ways of making such kind of algorithms robust to unexpected difficulties such as noise in the data, sources that stop the communication, corrupted data, etc.
This internship may be extended to a 3 years PhD.
Keywords: Data Stream, Machine Learning, Federated Learning.
References:
[1] https://www.influxdata.com/, https://kafka.apache.org/, https://redis.io/
[2] Muthukrishnan, S. (2005). Data streams: Algorithms and applications. Foundations and Trends® in Theoretical Computer Science, 1(2), 117-236.
[3] Palpanas, Themis. "Data series management: The road to big sequence analytics." ACM SIGMOD Record 44.2 (2015): 47-52.
[4] Google AI blog: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
Methods / Means
Java, Python, InfluxDB, Kafka, Redis
Applicant Profile
We look for a candidate with:
• Engineering diploma/master 2 studies, preferable in computer science.
• Strong programming skills (Java, R, Python).
• English proficiency speaking.
• French and/or Spanish is a plus.
• Knowledge in machine learning is a plus.
• Familiar with InfluxDB, Kafka or Redis [1] is a plus.
Position localisation
Site
Saclay
Job location
France, Ile-de-France, Essonne (91)
Location
Gif-sur-Yvette
Candidate criteria
Languages
English (Fluent)
Prepared diploma
Bac+5 - Master 2
Recommended training
Artificial Intelligence (Machine Learning)
PhD opportunity
Oui
Requester
Position start date
15/03/2021