Observed streamflow data from 966 medium sized catchments (1000-5000 km2) around the globe were used to comprehensively evaluate the daily runoff estimates (1979-2012) of six global hydrological models (GHMs) and four land surface models (LSMs) produced as part of tier-1 of the eartH2Observe project. The models were all driven by the WATCH Forcing Data ERA-Interim (WFDEI) meteorological dataset, but used different datasets for non-meteorologic inputs and were run at various spatial and temporal resolutions, although all data were re-sampled to a common 0. 5° spatial and daily temporal resolution. For the evaluation, we used a broad range of performance metrics related to important aspects of the hydrograph. We found pronounced inter-model performance differences, underscoring the importance of hydrological model uncertainty in addition to climate input uncertainty, for example in studies assessing the hydrological impacts of climate change. The uncalibrated GHMs were found to perform, on average, better than the uncalibrated LSMs in snow-dominated regions, while the ensemble mean was found to perform only slightly worse than the best (calibrated) model. The inclusion of less-accurate models did not appreciably degrade the ensemble performance. Overall, we argue that more effort should be devoted on calibrating and regionalizing the parameters of macro-scale models. We further found that, despite adjustments using gauge observations, the WFDEI precipitation data still contain substantial biases that propagate into the simulated runoff. The early bias in the spring snowmelt peak exhibited by most models is probably primarily due to the widespread precipitation underestimation at high northern latitudes.