Data is the new currency: Building and managing financial data pipelines for artificial intelligence readiness

Authors

Jeevani Singireddy
Intuit Inc, California, United States

Synopsis

In the past couple of years, various technologies have arisen to automate the building and maintenance of machine learning (ML) pipelines. In particular, the introduction of orchestration platforms and recent research into understanding ML pipeline failures have produced open source tools and guidelines that help users improve the reliability and observability of their ML deployments. These technologies focus on a subcomponent of the end-to-end machine learning workflow, which is the pipeline that produces the ML model, but such a pipeline relies on a similarly complex and critical data pipeline that produces the data used by the ML pipeline. This pipeline typically consists of ETL jobs (i.e., extraction, transformation, loading), data validation, feature engineering, and transformations that must run at various time intervals or data thresholds to keep up with the incoming data for production applications.

Managing a data pipeline, however, is significantly more complex than managing a ML pipeline. First, a data pipeline includes many more and more heterogeneous jobs than a ML pipeline. A typical production data pipeline may include hundreds of jobs, compared to a ML pipeline that often consists of three or four jobs. Data jobs can be highly heterogeneous: the job type may vary; the job result may vary; the input data type may vary; and the input/outputs of a job may be both tables and files. Thus, managing a data pipeline requires specialized management techniques that are much more general than those of a ML pipeline.

Second, a data pipeline consists of critical computations that are often hidden from the entities that use the pipeline results. For instance, the ML model lifespan depends on many times larger data and preparation methods than the ML model itself, and its legitimacy and appropriateness rely heavily on being able to verify component exactness and performance (not just infeasibility). Thus, breaches of public trust in data pipelines lead to larger and sometimes insurmountable issues that an organization must face to continue using the flawed products.

Downloads

Forthcoming

26 April 2025

How to Cite

Singireddy, J. . (2025). Data is the new currency: Building and managing financial data pipelines for artificial intelligence readiness . In Smart Finance: Harnessing Artificial Intelligence to Transform Tax, Accounting, Payroll, and Credit Management for the Digital Age (pp. 94-106). Deep Science Publishing. https://doi.org/10.70593/978-93-49910-40-9_7