Data provider is usually the scientist performing measurements in the field.
Data flow model :download:`specification <../graphics/DataFlowModel.svg>`.
The data provider must make sure that i) the raw data arrive in the landing-zone at the right place, ii) a conversions script exist to standardize the raw data, and iii) meta-data are provided.
Working on the real landing-zone would be dangerous. Therefore, all development and testing is done on a copy of the landing-zone. The datapool provides a command to create development landing-zones. A development landing-zones can have any names, but let's call it dlz for now:
This creates a folder (a copy of the real landing-zone) called dlz in the home directory. You can see how the created landing zone looks like with ls -l dlz.
The datapool provides varoius check to ensure that the provided conversions scripts and meta-data are consisntent. The checks are ran by:
If everything is fine, modify the develop landing-zone (e.g. add a new sensor) according to the instructions given below. After the modifications run the checks again.
If this succeeds update the operational landing zone:
All future raw data should be delivered directly into the operational database.
In the following the different types of modifications/additions are explained.
In order to add a new measuring site, the information about this site have to be provided as a site.yaml file in a new folder for the site, within the sites folder of the landingzone. The information to be specified are:
The structure of the file has to be the same as in the example below:
name: industry description: Site at pump house at industry. Installed nearly on top of the pump. Detection of pump possible! When a pump must be removed the sensor has to be taken out and installed after again. street: Undermülistrasse city: 8320 Fehraltorf, Switzerland coordinates: x: 47.394973 y: 8.733811 z: # pictures are optional: pictures: - path: images/installation.png description: Installation of Lora Ultrasonic Sensor 00769 # date is optional: date: 2016/08/22 12:00:00 - path: images/impact.jpg description: Impact Zone of Sensor 00769
The file parameters.yaml is stored in the data folder and contains all the parameters. New parameters can be added here. The information to be included are:
The syntax has to match the following example:
- name: Absorbance 202.50_nm unit: m-1 description: absorbance at 202.50 nm wavelength
TODOXS
todo
The files arriving in the landing zone are called raw data. Every raw data file must be converted into a so called standardized file by a conversion script. The format of the standardized files is defined below. Typically, every source instance needs an individually adapted conversion script.
The standardized file format for the input data is a csv file with either six or four columns. It must adhere the following standards:
timestamp | parameter | value | x | y | z |
---|---|---|---|---|---|
2013-11-13 10:06:00 | Water Level | 148.02 | 682558 | 239404 | |
2013-11-13 10:08:00 | Water Level | 146.28 | 682558 | 239404 | |
2013-11-13 10:08:00 | Average Flow Velocity | 0.64 | 682558 | 239404 | 412 |
... | ... | ... | ... | ... |
timestamp | parameter | value | site |
---|---|---|---|
2013-11-13 10:06:00 | Water Level | 148.02 | zurich |
2013-11-13 10:08:00 | Water Level | 146.28 | zurich |
2013-11-13 10:08:00 | Average Flow Velocity | 0.64 | zurich |
... | ... | ... | ... |
The conversion script must define a function which reads raw data and write an output file (a standardized file). The first argument if this function is the path to the input raw data, the second argument the path to the resulting file.
The follwing points should be considered when writing an conversion script:
The following code snippets show how a conversion script could look like for different languages.
# Example R conversion script # September 27, 2016 -- Alex Hunziker library(reshape2) convert <- function(raw_file, output_file){ data.raw <- utils::read.table(raw_file, sep="\t", skip=1, header=F) names(data.raw) <- c("Date Time", "Water Level", "Average Flow Velocity", "Flow", "Temperature", "Surface Flow Velocity", "Distance", "Distance Reading Count", "Surcharge Level", "Peak to Mean Ratio", "Number of Samples", "Battery Voltage") if(ncol(data.raw) !=12 ) stop(paste("Error: Input File has", ncol(data.raw), "columns, instead of the expected 12 columns.")) if(!all(sapply(data.raw[2:ncol(data.raw)], is.numeric)==TRUE)) stop("Error: Non-numeric input where numeric values were expected.") # define coordinate xcoor <- 682558 ycoor <- 239404 zcoor <- "" ## reformat data time <- strptime(data.raw$"Date Time", "%d.%m.%Y %H:%M") data.raw$"Date Time" <- format(time, "%Y-%m-%d %H:%M:%S") data.form <- reshape2::melt(data.raw, id.vars = c("Date Time")) colnames(data.form) <- c("timestamp", "parameter", "value") data.form$X <- xcoor data.form$Y <- ycoor data.form$Z <- zcoor # remove NA values data.form <- stats::na.omit(data.form) utils::write.table(data.form, file=output_file, row.names=FALSE, col.names=TRUE, quote=FALSE, sep=";") }
# Example Julia conversion script # September 27, 2016 -- Alex Hunziker module conversion_FloDar_Fehraltorf_2 # ---> 1.) load required package (optional) using DataFrames function convert(raw_file, output_file) # ---> 2.) read file if(!isfile(raw_file)) error("Error: raw_file does not exist.") end # the header line contains non-utf8 encoded characters, so we skip this: dataraw = DataFrame(readtable(raw_file, separator = '\t', skipstart=1, header=false)) names!(dataraw, map(symbol, ["Date Time", "Water Level", "Average Flow Velocity", "Flow", "Temperature", "Surface Flow Velocity", "Distance", "Distance Reading Count", "Surcharge Level", "Peak to Mean Ratio", "Number of Samples", "Battery Voltage"])) ## ---> 3.) test properties if(size(dataraw, 2) != 12) error("Imput File has wrong number of columns.") end ## ---> 4.) add additional information (optional) #Define coordinate xcoor = 682558 ycoor = 239404 zcoor = "" ## ---> 5.) reformate data selCol = symbol("Date Time") time = Dates.DateTime(dataraw[selCol], "dd.mm.yyyy HH:MM") dataraw[selCol] = Dates.format(time, "yyyy-mm-dd HH:MM") dataForm = stack(dataraw, [2:12], selCol) dataForm = dataForm[:, [selCol, :variable, :value]] dataForm[4] = xcoor dataForm[5] = ycoor dataForm[6] = zcoor names!(dataForm, [:timestamp, :parameter, :value, :x, :y, :z]) deleterows!(dataForm, find(isna(dataForm[:, symbol("value")]))) ## ---> 6.) write file writetable(output_file, dataForm, separator = ';') end end
# Example Python conversion script # September 27, 2016 -- Alex Hunziker # ---> 1.) load required packages (optional) import os.path import pandas def convert(raw_file, output_file): # ---> 2.) read file if not os.path.isfile(raw_file): raise ValueError('Error: Input File does not exist.') raw_data = pandas.read_csv(raw_file, sep='\t', encoding="latin-1") colNames = ("Date Time", "Water Level", "Average Flow Velocity", "Flow", "Temperature", "Surface Flow Velocity", "Distance", "Distance Reading Count", "Surcharge Level", "Peak to Mean Ratio", "Number of Samples", "Battery Voltage") raw_data.columns = colNames # ---> 3.) test properties if len(raw_data.columns) != 12: raise ValueError('Error: Input File has wrong number of columns.') # ---> 4.) add additional information (optional) # Define coordinate xcoor = 682558 ycoor = 239404 zcoor = "" # ---> 5.) reformat data time = pandas.to_datetime(raw_data['Date Time'], format="%d.%m.%Y %H:%M") raw_data['Date Time'] = time.apply(lambda x: x.strftime('%Y-%m-%d %H:%M')) data = pandas.melt(raw_data, id_vars=['Date Time'], value_vars=list(raw_data.columns[1:12])) data.columns = ['Date Time', 'parameter', 'value'] data = data.dropna() data['x'] = xcoor data['y'] = ycoor data['z'] = zcoor ## ---> 6.) write file data.to_csv(output_file, sep=";", index=False)
% % SWW-DWH: Example MatLab conversion script % % 19/12/2016 - Frank Blumensaat % Example: conversion('raw_data\data-001.raw','out.dat'); % ------------------------------------------------------- function conversion(fNameIn,fNameOut) % read full content of the file into 'data' fid = fopen(fullfile(fNameIn), 'r'); dataRaw = textscan(fid, '%s %f %f %f %f %f %f %f %f %f %f %f', Inf, 'Delimiter','\t','TreatAsEmpty',... {'NA'},'HeaderLines',1); fclose(fid); % possible to include check if 12 columns and numeric val's in col2 - col12 fid = fopen(fullfile(fNameIn), 'r'); names = textscan(fid, '%s %s %s %s %s %s %s %s %s %s %s %s', 1,'Delimiter','\t','HeaderLines',0); fclose(fid); % % parse string of TRANSFER time (time stamp) into ML number datTime = datenum(dataRaw{1,1}(:),'DD.mm.YYYY hh:MM'); % define coordinates xcoor = ones(length(dataRaw{1}),1).*682558; ycoor = ones(length(dataRaw{1}),1).*239404; zcoor = zeros(length(dataRaw{1}),1); % split data matrix acc. to parameter and remove NaNs for j = 2:size(dataRaw,2) dataSplit(j-1).var = excise([datTime dataRaw{1,j} xcoor ycoor zcoor]); end % some parameter names are not conforming to parameters.yaml: parametersRaw = {'Level', 'Velocity', 'Surface Velocity', 'PMR', 'NOS', 'Power Supply'}; parametersUniform = {'Water Level', 'Average Flow Velocity', 'Surface Flow Velocity',... 'Peak to Mean Ratio', 'Number of Samples', 'Battery Voltage'}; fixNames = containers.Map(parametersRaw,parametersUniform); % write processed data to a cell array celldata = {}; clear celldataTemp for k = 1:length(dataSplit) for i = 1:length(dataSplit(k).var) celldataTemp{i,1} = datestr(dataSplit(k).var(i,1),'yyyy-mm-dd HH:MM:SS'); % following the ISO 8601 data standard name = char(names{k+1}); % our parameters.yaml does not have the units in (..), so we remove them: name = regexprep(name, '\(.*\)', ''); % correct some names: if isKey(fixNames, name) name = fixNames(name); end celldataTemp{i,2} = name; celldataTemp{i,3} = dataSplit(k).var(i,2); celldataTemp{i,4} = dataSplit(k).var(i,3); celldataTemp{i,5} = dataSplit(k).var(i,4); celldataTemp{i,6} = ''; end celldata = vertcat(celldata,celldataTemp); clear celldataTemp end %% write selected data to TXT file fid = fopen(fullfile(fNameOut),'w'); fprintf(fid, '%s; %s; %s; %s; %s; %s \n', 'timestamp', 'parameter', 'value', 'x', 'y', 'z'); [nrows] = size(celldata); for row = 1:nrows fprintf(fid,'%s; %s; %f; %d; %d; %d \n',celldata{row,:}); end fclose(fid); end %% function to remove NaN values function X = excise(X) X(any(isnan(X)'),:) = []; end