TY - GEN
T1 - Provenance-based refresh in data-oriented workflows
AU - Ikeda, Robert
AU - Salihoglu, Semih
AU - Widom, Jennifer
N1 - KAUST Repository Item: Exported on 2020-10-01
Acknowledgements: This work is supported by the National Science Foundation undergrants IIS-0414762 and IIS-0904497 and by a KAUST researchgrant.
This publication acknowledges KAUST support, but has no KAUST affiliated authors.
PY - 2011
Y1 - 2011
N2 - We consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Our goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific output elements when the input data may have changed. We explore how data provenance can be used to enable efficient refresh. Our approach is based on capturing one-level data provenance at each transformation when the workflow is run initially. Then at refresh time provenance is used to determine (transitively) which input elements are responsible for given output elements, and the workflow is rerun only on that portion of the data needed for refresh. Our contributions are to formalize the problem setting and the problem itself, to specify properties of transformations and provenance that are required for efficient refresh, and to provide algorithms that apply to a wide class of transformations and workflows. We have built a prototype system supporting the features and algorithms presented in the paper. We report preliminary experimental results on the overhead of provenance capture, and on the crossover point between selective refresh and full workflow recomputation. © 2011 ACM.
AB - We consider a general workflow setting in which input data sets are processed by a graph of transformations to produce output results. Our goal is to perform efficient selective refresh of elements in the output data, i.e., compute the latest values of specific output elements when the input data may have changed. We explore how data provenance can be used to enable efficient refresh. Our approach is based on capturing one-level data provenance at each transformation when the workflow is run initially. Then at refresh time provenance is used to determine (transitively) which input elements are responsible for given output elements, and the workflow is rerun only on that portion of the data needed for refresh. Our contributions are to formalize the problem setting and the problem itself, to specify properties of transformations and provenance that are required for efficient refresh, and to provide algorithms that apply to a wide class of transformations and workflows. We have built a prototype system supporting the features and algorithms presented in the paper. We report preliminary experimental results on the overhead of provenance capture, and on the crossover point between selective refresh and full workflow recomputation. © 2011 ACM.
UR - http://hdl.handle.net/10754/599413
UR - http://dl.acm.org/citation.cfm?doid=2063576.2063816
UR - http://www.scopus.com/inward/record.url?scp=83055161612&partnerID=8YFLogxK
U2 - 10.1145/2063576.2063816
DO - 10.1145/2063576.2063816
M3 - Conference contribution
SN - 9781450307178
SP - 1659
EP - 1668
BT - Proceedings of the 20th ACM international conference on Information and knowledge management - CIKM '11
PB - Association for Computing Machinery (ACM)
ER -