Mining Software Artifacts for use in Automated Machine Learning
Speaker
José Pablo Cambronero
MIT-CSAIL
Host
Martin Rinard
MIT-CSAIL
Abstract:
Successfully implementing classical supervised machine learning pipelines
requires that users have software engineering, machine learning, and domain
experience. Machine learning libraries have helped along the first two
dimensions by providing modular implementations of popular algorithms. However,
implementing a pipeline remains an iterative, tedious, and data-dependent task
as users have to experiment with different pipeline designs.
To make the pipeline development process accessible to non-experts and more
efficient for experts, automated techniques can be used to efficiently search
for high performing pipelines with little user intervention. The collection of
techniques and systems that automate this task are commonly termed automated
machine learning (AutoML).
Inspired by the success of software mining in areas such as code search, program
synthesis, and program repair, we investigate the hypothesis that information
mined from software artifacts can be used to build, improve interactions with,
and address missing use cases of AutoML. In particular, I will present three
systems -- AL, AMS, and Janus -- that make use of software artifacts. AL mines
dynamic execution traces of a collection of programs that implement machine
learning pipelines and uses these mined traces to learn to produce new
pipelines. AMS mines documentation and program examples to automatically
generate a search space for an AutoML tool by starting from a user-chosen set of
API components. And Janus mines pipeline transformations from a collection of
machine learning pipelines, which can be used to improve an input pipeline while
producing a nearby variant. Jointly, these systems and their experimental
results show that mining software artifacts can simplify AutoML systems, make
their customization easier, and apply them to novel use cases.
Thesis Committee: Martin Rinard (advisor), Saman Amarasinghe and Armando Solar-Lezama
Link: https://mit.zoom.us/j/91205512750?pwd=Y1ZKRDd4WEtjNXN0a1BQajBiUERxdz09
For Password contact: José Pablo Cambronero, josepablocam@gmail.com or Mary McDavitt, mmcdavit@csail.mit.edu
Successfully implementing classical supervised machine learning pipelines
requires that users have software engineering, machine learning, and domain
experience. Machine learning libraries have helped along the first two
dimensions by providing modular implementations of popular algorithms. However,
implementing a pipeline remains an iterative, tedious, and data-dependent task
as users have to experiment with different pipeline designs.
To make the pipeline development process accessible to non-experts and more
efficient for experts, automated techniques can be used to efficiently search
for high performing pipelines with little user intervention. The collection of
techniques and systems that automate this task are commonly termed automated
machine learning (AutoML).
Inspired by the success of software mining in areas such as code search, program
synthesis, and program repair, we investigate the hypothesis that information
mined from software artifacts can be used to build, improve interactions with,
and address missing use cases of AutoML. In particular, I will present three
systems -- AL, AMS, and Janus -- that make use of software artifacts. AL mines
dynamic execution traces of a collection of programs that implement machine
learning pipelines and uses these mined traces to learn to produce new
pipelines. AMS mines documentation and program examples to automatically
generate a search space for an AutoML tool by starting from a user-chosen set of
API components. And Janus mines pipeline transformations from a collection of
machine learning pipelines, which can be used to improve an input pipeline while
producing a nearby variant. Jointly, these systems and their experimental
results show that mining software artifacts can simplify AutoML systems, make
their customization easier, and apply them to novel use cases.
Thesis Committee: Martin Rinard (advisor), Saman Amarasinghe and Armando Solar-Lezama
Link: https://mit.zoom.us/j/91205512750?pwd=Y1ZKRDd4WEtjNXN0a1BQajBiUERxdz09
For Password contact: José Pablo Cambronero, josepablocam@gmail.com or Mary McDavitt, mmcdavit@csail.mit.edu