Profile Picture

Tainã Coleman

Schmidt AI in Science Postdoctoral Fellow, San Diego Supercomputer Center
Ph.D., University of Southern California
M.S., California State University, Long Beach
B.S., Universidade Federal de Itajubá

Research Interests

My research interests are in Scientific Workflows. In particular, my goal is to develop a better understanding of how workflow structure affects execution in high-performance computing environments. To this end, I develop algorithms, benchmarks, and data-driven approaches for extracting and exploiting structure in scientific workflows.

Publications

Automated generation of scientific workflow generators with WfChef

Future Generation Computer Systems
Abstract

Scientific workflow applications have gained significant importance, and their automated and efficient execution on large-scale computing platforms has been the subject of extensive research and development. For these efforts to be successful, a solid experimental methodology is needed to evaluate workflow algorithms and systems. A foundation for this methodology is the availability of realistic workflow instances. Although public repositories provide workflow instances for a few scientific applications, these are limited in scope, and workflow instances are not available for all application scales of interest. To address this limitation, previous work has developed generators of synthetic workflow instances of arbitrary scales. Despite being popular, the implementation of these generators is a manual and labor-intensive process that requires expert application knowledge. As a result, these generators only target a handful of applications, even though there are hundreds of workflow applications in production.We introduce WfChef , a fully automated framework for constructing a synthetic workflow generator for any scientific application. Based on an input set of workflow instances for a particular application, WfChef automatically produces a synthetic workflow generator. To measure the realism of the generated workflows, we define and evaluate several metrics. Using these metrics, we compare the realism of the workflows generated by WfChef generators to that of the workflows generated by the previously available, hand-crafted generators. We find that WfChef generators not only require zero development effort (because they are automatically produced), but also generate workflows that are more realistic than those generated by hand-crafted generators.

Authors

Taina Coleman
Henri Casanova
Rafael Ferreira Da Silva

Workflows Community Summit 2022: A Roadmap Revolution

osti.gov
Abstract

Scientific workflows have become integral tools in broad scientific computing use cases. Science discovery is increasingly dependent on workflows to orchestrate large and complex scientific experiments that range from the execution of a cloud-based data preprocessing pipeline to multi-facility instrument-toedge-to-HPC computational workflows. Given the changing landscape of scientific computing (often referred to as a computing continuum) and the evolving needs of emerging scientific applications, it is paramount that the development of novel scientific workflows and system functionalities seek to increase the efficiency, resilience, and pervasiveness of existing systems and applications. Specifically, the proliferation of machine learning/artificial intelligence (ML/AI) workflows, need for processing large-scale datasets produced by instruments at the edge, intensification of near real-time data processing, support for long-term experiment campaigns, and emergence of quantum computing as an adjunct to HPC, have significantly changed the functional and operational requirements of workflow systems. Workflow systems now need to, for example, support data streams from the edge-to-cloud-to-HPC, enable the management of many small-sized files [6], allow data reduction while ensuring high accuracy, orchestrate distributed services (workflows, instruments, data movement, provenance, publication, etc.) across computing and user facilities, among others. Further, to accelerate science, it is also necessary that these systems implement specifications/standards and APIs for seamless (horizontal and vertical) integration between systems and applications, as well as enable the publication of workflows and their associated products according to the FAIR principles.

Authors

Rafael Ferreira Da Silva
Rosa Badia
Venkat Bala
Deborah Bard
Peer-Timo Bremer
Ian Buckley
Silvina Caino-Lores
Kyle Chard
Carole Goble
Shantenu Jha
Daniel S Katz
Daniel Laney
Manish Parashar
Fred Suter
Nick Tyler
Thomas Uram
Ilkay Altintas
Stefan Andersson
William Arndt
Juan Aznar
Jonathan Bader
Bartosz Balis
Christopher Blanton
Kelly Braghetto
Aharon Brodutch
Paul Brunk
Henri Casanova
Alba Lierta
Justin Chigu
Taina Coleman
Nick Collier
Iacopo Colonnelli
Frederik Coppens
Michael Crusoe
Will Cunningham
Bruno Kinoshita
Paolo Di Tomasso
Charles Doutriaux
Matthew Downton
Wael Elwasif
Bjoern Enders
Christopher Erdmann
Thomas Fahringer
Ludmilla Figueiredo
Rosa Filgueira
Martin Foltin
Anne Fouilloux
Luiz Gadelha
Andy Gallo
Artur Garcia
Daniel Garijo
Roman Gerlach
Ryan E Grant
Samuel Grayson
Patricia Grubel
Johan Gustafsson
Valerie Hayot
Oscar Hernandez Mendoza
Marcus Hilbrich
Annmary Justine
Ian Laflotte
Fabian Lehmann
Andre Luckow
Jakob Luettgau
Ketan Maheshwari
Motohiko Matsuda
Doriana Medic
Pete Mendygral
Marek Michalewicz
Jorji Nonaka
Maciej Pawlik
Loic Pottier
Line Pouchard
Mathias Putz
Santosh Radha
Lavanya Ramakrishnan
Sashko Ristov
Paul Romano
Daniel Rosendo
Martin Ruefenacht
Katarzyna Rycerz
Nishant Saurabh
Volodymyr Savchenko
Martin Schulz
Christine Simpson
Raul Sirvent
Tyler Skluzacek
Stian Reyes
Renan Santos Souza
Sreenivas R Sukumar
Ziheng Sun
Alan Sussman
Douglas Thain
Mikhail Titov
Benjamin Tovar
Aalap Tripathy
Matteo Turilli
Bartosz Tuznik
Hubertus van Dam
Aurelio Vivas
Logan Ward
Patrick Widener
Sean Wilkinson
Justyna Zawalska
Mahnoor Zulfiqar

Wfbench: Automated generation of scientific workflow benchmarks

2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
Abstract

The prevalence of scientific workflows with high computational demands calls for their execution on various distributed computing platforms, including large-scale leadership-class high-performance computing (HPC) clusters. To handle the deployment, monitoring, and optimization of workflow executions, many workflow systems have been developed over the past decade. There is a need for workflow benchmarks that can be used to evaluate the performance of workflow systems on current and future software stacks and hardware platforms.We present a generator of realistic workflow benchmark specifications that can be translated into benchmark code to be executed with current workflow systems. Our approach generates workflow tasks with arbitrary performance characteristics (CPU, memory, and I/O usage) and with realistic task dependency structures based on those seen in production workflows. We present experimental results that show that our approach generates benchmarks that are representative of production workflows, and conduct a case study to demonstrate the use and usefulness of our generated benchmarks to evaluate the performance of workflow systems under different configuration scenarios.

Authors

Taina Coleman
Henri Casanova
Ketan Maheshwari
Loïc Pottier
Sean R Wilkinson
Justin Wozniak
Frédéric Suter
Mallikarjun Shankar
Rafael Ferreira Da Silva

WfCommons: A Framework for Enabling Scientific Workflow Research and Development

Future Generation Computer Systems
Abstract

Scientific workflows are a cornerstone of modern scientific computing. They are used to describe complex computational applications that require efficient and robust management of large volumes of data, which are typically stored/processed on heterogeneous, distributed resources. The workflow research and development community has employed a number of methods for the quantitative evaluation of existing and novel workflow algorithms and systems. In particular, a common approach is to simulate workflow executions. In previous works, we have presented a collection of tools that have been adopted by the community for conducting workflow research. Despite their popularity, they suffer from several shortcomings that prevent easy adoption, maintenance, and consistency with the evolving structures and computational requirements of production workflows. In this work, we present WfCommons, a framework that provides a collection of tools for analyzing workflow executions, for producing generators of synthetic workflows, and for simulating workflow executions. We demonstrate the realism of the generated synthetic workflows by comparing their simulated executions to real workflow executions. We also contrast these results with results obtained when using the previously available collection of tools. We find that the workflow generators that are automatically constructed by our framework not only generate representative same-scale workflows (i.e., with structures and task characteristics distributions that resemble those observed in real-world workflows), but also do so at scales larger than that of available real-world workflows. Finally, we conduct a case study to demonstrate the usefulness of our framework for estimating the energy consumption of large-scale workflow executions.

A Community Roadmap for Scientific Workflows Research and Development

arXiv preprint
Abstract

The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated research claims, and a steep learning curve. To address some of these challenges and lay the groundwork for transforming workflows research and development, the WorkflowsRI and ExaWorks projects partnered to bring the international workflows community together. This paper reports on discussions and findings from two virtual 'Workflows Community Summits' (January and April, 2021). The overarching goals of these workshops were to develop a view of the state of the art, identify crucial research challenges in the workflows community, articulate a vision for potential community efforts, and discuss technical approaches for realizing this vision. To this end, participants identified six broad themes: FAIR computational workflows; AI workflows; exascale challenges; APIs, interoperability, reuse, and standards; training and education; and building a workflows community. We summarize discussions and recommendations for each of these themes.

Authors

Rafael Ferreira da Silva
Henri Casanova
Kyle Chard
Ilkay Altintas
Rosa M Badia
Bartosz Balis
Tainã Coleman
Frederik Coppens
Frank Di Natale
Bjoern Enders
Thomas Fahringer
Rosa Filgueira
Grigori Fursin
Daniel Garijo
Carole Goble
Dorran Howell
Shantenu Jha
Daniel S. Katz
Daniel Laney
Ulf Leser
Maciej Malawski
Kshitij Mehta
Loïc Pottier
Jonathan Ozik
J. Luc Peterson
Lavanya Ramakrishnan
Stian Soiland-Reyes
Douglas Thain
Matthew Wolf

Evaluating energy-aware scheduling algorithms for I/O-intensive scientific workflows

International Conference on Computational Science (ICCS)
Abstract

Improving energy efficiency has become necessary to enable sustainable computational science. At the same time, scientific workflows are key in facilitating distributed computing in virtually all domain sciences. As data and computational requirements increase, I/O-intensive workflows have become prevalent. In this work, we evaluate the ability of two popular energy-aware workflow scheduling algorithms to provide effective schedules for this class of workflow applications, that is, schedules that strike a good compromise between workflow execution time and energy consumption. These two algorithms make decisions based on a widely used power consumption model that simply assumes linear correlation to CPU usage. Previous work has shown this model to be inaccurate, in particular for modeling power consumption of I/O-intensive workflow executions, and has proposed an accurate model. We evaluate the effectiveness of the two aforementioned algorithms based on this accurate model. We find that, when making their decisions, these algorithms can underestimate power consumption by up to 360%, which makes it unclear how well these algorithm would fare in practice. To evaluate the benefit of using the more accurate power consumption model in practice, we propose a simple scheduling algorithm that relies on this model to balance the I/O load across the available compute resources. Experimental results show that this algorithm achieves more desirable compromises between energy consumption and workflow execution time than the two popular algorithms.

Workflows Community Summit: Bringing the Scientific Workflows Community Together

Zenodo
Abstract

Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure. Due to the popularity of workflows, workflow management systems (WMSs) have been developed to provide abstractions for creating and executing workflows conveniently, efficiently, and portably. While these efforts are all worthwhile, there are now hundreds of independent WMSs, many of which are moribund. As a result, the WMS landscape is segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. As a result, many teams, small and large, still elect to build their own custom workflow solution rather than adopt, or build upon, existing WMSs. This current state of the WMS landscape negatively impacts workflow users, developers, and researchers. The 'Workflows Community Summit' was held online on January 13, 2021. The overarching goal of the summit was to develop a view of the state of the art and identify crucial research challenges in the workflow community. Prior to the summit, a survey sent to stakeholders in the workflow community (including both developers of WMSs and users of workflows) helped to identify key challenges in this community that were translated into 6 broad themes for the summit, each of them being the object of a focused discussion led by a volunteer member of the community. This report documents and organizes the wealth of information provided by the participants before, during, and after the summit.

Authors

Rafael Ferreira da Silva
Henri Casanova
Kyle Chard
Dan Laney
Dong Ahn
Shantenu Jha
Carole Goble
Lavanya Ramakrishnan
Luc Peterson
Bjoern Enders
Douglas Thain
Ilkay Altintas
Yadu Babuji
Rosa M. Badia
Vivien Bonazzi
Taina Coleman
Michael Crusoe
Ewa Deelman
Frank Di Natale
Paolo Di Tommaso
Thomas Fahringer
Rosa Filgueira
Grigori Fursin
Alex Ganose
Bjorn Gruning
Daniel S. Katz
Olga Kuchar
Ana Kupresanin
Bertram Ludascher
Ketan Maheshwari
Marta Mattoso
Kshitij Mehta
Todd Munson
Jonathan Ozik
Tom Peterka
Loic Pottier
Tim Randles
Stian Soiland-Reyes
Benjamin Tovar
Matteo Turilli
Thomas Uram
Karan Vahi
Michael Wilde
Matthew Wolf
Justin Wozniak

WorkflowHub: Community Framework for Enabling Scientific Workflow Research and Development

2020 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)
Abstract

Scientific workflows are a cornerstone of modern scientific computing. They are used to describe complex computational applications that require efficient and robust management of large volumes of data, which are typically stored/processed on heterogeneous, distributed resources. The workflow research and development community has employed a number of methods for the quantitative evaluation of existing and novel workflow algorithms and systems. In particular, a common approach is to simulate workflow executions. In previous work, we have presented a collection of tools that have been used for aiding research and development activities in the Pegasus project, and that have been adopted by others for conducting workflow research. Despite their popularity, there are several shortcomings that prevent easy adoption, maintenance, and consistency with the evolving structures and computational requirements of production workflows. In this work, we present WorkflowHub, a community framework that provides a collection of tools for analyzing workflow execution traces, producing realistic synthetic workflow traces, and simulating workflow executions. We demonstrate the realism of the generated synthetic traces by comparing simulated executions of these traces with actual workflow executions. We also contrast these results with those obtained when using the previously available collection of tools. We find that our framework not only can be used to generate representative synthetic workflow traces (i.e., with workflow structures and task characteristics distributions that resemble those in traces obtained from real-world workflow executions), but can also generate representative workflow traces at larger scales than that of available workflow traces.

WfChef: Automated Generation of Accurate Scientific Workflow Generators

17th IEEE eScience 2021
Abstract

Scientific workflow applications have become mainstream and their automated and efficient execution on large-scale compute platforms is the object of extensive research and development. For these efforts to be successful, a solid experimental methodology is needed to evaluate workflow algorithms and systems. A foundation for this methodology is the availability of realistic workflow instances. Dozens of workflow instances for a few scientific applications are available in public repositories. While these are invaluable, they are limited: workflow instances are not available for all application scales of interest. To address this limitation, previous work has developed generators of synthetic, but representative, workflow instances of arbitrary scales. These generators are popular, but implementing them is a manual, labor-intensive process that requires expert application knowledge. As a result, these generators only target a handful of applications, even though hundreds of applications use workflows in production.In this work, we present WfChef, a framework that fully automates the process of constructing a synthetic workflow generator for any scientific application. Based on an input set of workflow instances, WfChef automatically produces a synthetic workflow generator. We define and evaluate several metrics for quantifying the realism of the generated workflows. Using these metrics, we compare the realism of the workflows generated by WfChef generators to that of the workflows generated by the previously available, hand-crafted generators. We find that the WfChef generators not only require zero development effort (because it is automatically produced), but also generate workflows that are more realistic than those generated by hand-crafted generator.

A biometric for shark dorsal fins based on boundary descriptor matching

CAINE: Computer Applications in Industry and Engineering (2019)
Abstract

Recent progress in animal biometrics has revolutionized wildlife research. Cutting edge techniques allow researchers to track individuals through noninvasive methods of recognition that are not only more reliable, but also applicable to large, hard-to-find, and otherwise difficult to observe animals. In this research, we propose a metric for boundary descriptors based on bipartite perfect matching applied in shark dorsal fins. In order to identify a shark, we first take a fin contour and transform it to a normalized coordinate system so that we can analyze images of sharks regardless of orientation and scale. Finally, we propose a metric scheme that performs a minimum weight perfect matching in a bipartite graph. The experimental results show that our metric is applicable to identify and track individuals from visual data.

Projects

  • WfCommons

    WfCommons (wfcommons.org) an open-source framework that aims at supporting and at bridging theoretical and practical aspects of workflow systems research and development .