readabs.read_abs_cat
Download timeseries data from the Australian Bureau of Statistics (ABS) for a specified ABS catalogue identifier.
1"""Download *timeseries* data from the Australian Bureau 2of Statistics (ABS) for a specified ABS catalogue identifier.""" 3 4# --- imports --- 5# standard library imports 6from functools import cache 7from typing import Any 8import calendar 9 10# analytic imports 11import pandas as pd 12from pandas import DataFrame 13 14# local imports 15from readabs.abs_meta_data import metacol 16from readabs.read_support import HYPHEN 17from readabs.grab_abs_url import grab_abs_url 18 19 20# --- functions --- 21# - public - 22@cache # minimise slowness for any repeat business 23def read_abs_cat( 24 cat: str, 25 keep_non_ts: bool = False, 26 **kwargs: Any, 27) -> tuple[dict[str, DataFrame], DataFrame]: 28 """This function returns the complete ABS Catalogue information as a 29 python dictionary of pandas DataFrames, as well as the associated metadata 30 in a separate DataFrame. The function automates the collection of zip and 31 excel files from the ABS website. If necessary, these files are downloaded, 32 and saved into a cache directory. The files are then parsed to extract time 33 series data, and the associated metadata. 34 35 By default, the cache directory is `./.readabs_cache/`. You can change the 36 default directory name by setting the shell environment variable 37 `READABS_CACHE_DIR` with the name of the preferred directory. 38 39 Parameters 40 ---------- 41 42 cat : str 43 The ABS Catalogue Number for the data to be downloaded and made 44 available by this function. This argument must be specified in the 45 function call. 46 47 keep_non_ts : bool = False 48 A flag for whether to keep the non-time-series tables 49 that might form part of an ABS catalogue item. Normally, the 50 non-time-series information is ignored, and not made available to 51 the user. 52 53 **kwargs : Any 54 The following parameters may be passed as optional keyword arguments. 55 56 history : str = "" 57 Orovide a month-year string to extract historical ABS data. 58 For example, you can set history="dec-2023" to the get the ABS data 59 for a catalogue identifier that was originally published in respect 60 of Q4 of 2023. Note: not all ABS data sources are structured so that 61 this technique works in every case; but most are. 62 63 verbose : bool = False 64 Setting this to true may help diagnose why something 65 might be going wrong with the data retrieval process. 66 67 ignore_errors : bool = False 68 Normally, this function will cease downloading when 69 an error in encountered. However, sometimes the ABS website has 70 malformed links, and changing this setting is necessitated. (Note: 71 if you drop a message to the ABS, they will usually fix broken 72 links with a business day). 73 74 get_zip : bool = True 75 Download the excel files in .zip files. 76 77 get_excel_if_no_zip : bool = True 78 Only try to download .xlsx files if there are no zip 79 files available to be downloaded. Only downloading individual excel 80 files when there are no zip files to download can speed up the 81 download process. 82 83 get_excel : bool = False 84 The default value means that excel files are not 85 automatically download. Note: at least one of `get_zip`, 86 `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS 87 catalogue items, it is sufficient to just download the one zip 88 file. But note, some catalogue items do not have a zip file. 89 Others have quite a number of zip files. 90 91 single_excel_only : str = "" 92 If this argument is set to a table name (without the 93 .xlsx extension), only that excel file will be downloaded. If 94 set, and only a limited subset of available data is needed, 95 this can speed up download times significantly. Note: overrides 96 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`. 97 98 single_zip_only : str = "" 99 If this argument is set to a zip file name (without 100 the .zip extension), only that zip file will be downloaded. 101 If set, and only a limited subset of available data is needed, 102 this can speed up download times significantly. Note: overrides 103 `get_zip`, `get_excel_if_no_zip`, and `get_excel`. 104 105 cache_only : bool = False 106 If set to True, this function will only access 107 data that has been previously cached. Normally, the function 108 checks the date of the cache data against the date of the data 109 on the ABS website, before deciding whether the ABS has fresher 110 data that needs to be downloaded to the cache. 111 112 Returns 113 ------------- 114 tuple[dict[str, DataFrame], DataFrame] 115 The function returns a tuple of two items. The first item is a 116 python dictionary of pandas DataFrames (which is the primary data 117 associated with the ABS catalogue item). The second item is a 118 DataFrame of ABS metadata for the ABS collection. 119 120 Example 121 ------- 122 123 ```python 124 import readabs as ra 125 from pandas import DataFrame 126 cat_num = "6202.0" # The ABS labour force survey 127 data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num) 128 abs_dict, meta = data 129 ```""" 130 131 # --- get the time series data --- 132 raw_abs_dict = grab_abs_url(cat=cat, **kwargs) 133 abs_dict, abs_meta = _get_time_series_data( 134 cat, raw_abs_dict, keep_non_ts=keep_non_ts, **kwargs 135 ) 136 137 return abs_dict, abs_meta 138 139 140# - private - 141def _get_time_series_data( 142 cat: str, 143 abs_dict: dict[str, DataFrame], 144 **kwargs: Any, 145) -> tuple[dict[str, DataFrame], DataFrame]: 146 """Using the raw DataFrames from the ABS website, extract the time series 147 data for a specific ABS catalogue identifier. The data is returned in a 148 tuple. The first element is a dictionary of DataFrames, where each 149 DataFrame contains the time series data. The second element is a DataFrame 150 of meta data, which describes each data item in the dictionary""" 151 152 # --- set up --- 153 new_dict: dict[str, DataFrame] = {} 154 meta_data = DataFrame() 155 156 # --- group the sheets and iterate over these groups 157 long_groups = _group_sheets(abs_dict) 158 for table, sheets in long_groups.items(): 159 args = { 160 "cat": cat, 161 "from_dict": abs_dict, 162 "table": table, 163 "long_sheets": sheets, 164 } 165 new_dict, meta_data = _capture(new_dict, meta_data, args, **kwargs) 166 return new_dict, meta_data 167 168 169def _copy_raw_sheets( 170 from_dict: dict[str, DataFrame], 171 long_sheets: list[str], 172 to_dict: dict[str, DataFrame], 173 keep_non_ts, 174) -> dict[str, DataFrame]: 175 """A utility function to copy the raw sheets across to 176 the final dictionary. Used if the data is not in a 177 timeseries format, and keep_non_ts flag is set to True. 178 Returns an updated final dictionary.""" 179 180 if not keep_non_ts: 181 return to_dict 182 183 for sheet in long_sheets: 184 if sheet in from_dict: 185 to_dict[sheet] = from_dict[sheet] 186 else: 187 # should not happen 188 raise ValueError(f"Glitch: Sheet {sheet} not found in the data.") 189 return to_dict 190 191 192def _capture( 193 to_dict: dict[str, DataFrame], 194 meta_data: DataFrame, 195 args: dict[str, Any], 196 **kwargs: Any, 197) -> tuple[dict[str, DataFrame], DataFrame]: 198 """For a specific Excel file, capture *both* the time series data 199 from the ABS data files as well as the meta data. These data are 200 added to the input 'to_dict" and 'meta_data' respectively, and 201 the combined results are returned as a tuple.""" 202 203 # --- step 0: set up --- 204 keep_non_ts: bool = kwargs.get("keep_non_ts", False) 205 ignore_errors: bool = kwargs.get("ignore_errors", False) 206 207 # --- step 1: capture the meta data --- 208 short_names = [x.split(HYPHEN, 1)[1] for x in args["long_sheets"]] 209 if "Index" not in short_names: 210 print(f"Table {args["table"]} has no 'Index' sheet.") 211 to_dict = _copy_raw_sheets( 212 args["from_dict"], args["long_sheets"], to_dict, keep_non_ts 213 ) 214 return to_dict, meta_data 215 index = short_names.index("Index") 216 217 index_sheet = args["long_sheets"][index] 218 this_meta = _capture_meta(args["cat"], args["from_dict"], index_sheet) 219 if this_meta.empty: 220 to_dict = _copy_raw_sheets( 221 args["from_dict"], args["long_sheets"], to_dict, keep_non_ts 222 ) 223 return to_dict, meta_data 224 225 meta_data = pd.concat([meta_data, this_meta], axis=0) 226 227 # --- step 2: capture the actual time series data --- 228 data = _capture_data(meta_data, args["from_dict"], args["long_sheets"], **kwargs) 229 if len(data): 230 to_dict[args["table"]] = data 231 else: 232 # a glitch: we have the metadata but not the actual data 233 error = f"Unexpected: {args["table"]} has no actual data." 234 if not ignore_errors: 235 raise ValueError(error) 236 print(error) 237 to_dict = _copy_raw_sheets( 238 args["from_dict"], args["long_sheets"], to_dict, keep_non_ts 239 ) 240 241 return to_dict, meta_data 242 243 244def _capture_data( 245 abs_meta: DataFrame, 246 from_dict: dict[str, DataFrame], 247 long_sheets: list[str], 248 **kwargs: Any, 249) -> DataFrame: 250 """Take a list of ABS data sheets, find the DataFrames for those sheets in the 251 from_dict, and stitch them into a single DataFrame with an appropriate 252 PeriodIndex.""" 253 254 # --- step 0: set up --- 255 verbose: bool = kwargs.get("verbose", False) 256 merged_data = DataFrame() 257 header_row: int = 8 258 259 # --- step 1: capture the time series data --- 260 # identify the data sheets in the list of all sheets from Excel file 261 data_sheets = [x for x in long_sheets if x.split(HYPHEN, 1)[1].startswith("Data")] 262 263 for sheet_name in data_sheets: 264 if verbose: 265 print(f"About to cature data from {sheet_name=}") 266 267 # --- capture just the data, nothing else 268 sheet_data = from_dict[sheet_name].copy() 269 270 # get the columns 271 header = sheet_data.iloc[header_row] 272 sheet_data.columns = pd.Index(header) 273 sheet_data = sheet_data[(header_row + 1) :] 274 275 # get the row indexes 276 sheet_data = _index_to_period(sheet_data, sheet_name, abs_meta, verbose) 277 278 # --- merge data into a single dataframe 279 if len(merged_data) == 0: 280 merged_data = sheet_data 281 else: 282 merged_data = pd.merge( 283 left=merged_data, 284 right=sheet_data, 285 how="outer", 286 left_index=True, 287 right_index=True, 288 suffixes=("", ""), 289 ) 290 291 # --- step 2 - final tidy-ups 292 # remove NA rows 293 merged_data = merged_data.dropna(how="all") 294 # check for NA columns - rarely happens 295 # Note: these empty columns are not removed, 296 # but it is useful to know they are there 297 if merged_data.isna().all().any() and verbose: 298 cols = merged_data.columns[merged_data.isna().all()] 299 print( 300 "Caution: these columns are all NA in " 301 + f"{merged_data[metacol.table].iloc[0]}: {cols}" 302 ) 303 304 # check for duplicate columns - should not happen 305 # Note: these duplicate columns are removed 306 duplicates = merged_data.columns.duplicated() 307 if duplicates.any(): 308 if verbose: 309 dup_table = abs_meta[metacol.table].iloc[0] 310 print( 311 f"Note: duplicates removed from {dup_table}: " 312 + f"{merged_data.columns[duplicates]}" 313 ) 314 merged_data = merged_data.loc[:, ~duplicates].copy() 315 316 # make the data all floats. 317 merged_data = merged_data.astype(float).sort_index() 318 319 return merged_data 320 321 322def _index_to_period( 323 sheet_data: DataFrame, sheet_name: str, abs_meta: DataFrame, verbose: bool 324) -> DataFrame: 325 """Convert the index of a DataFrame to a PeriodIndex.""" 326 327 index_column = sheet_data[sheet_data.columns[0]].astype(str) 328 sheet_data = sheet_data.drop(sheet_data.columns[0], axis=1) 329 long_row_names = index_column.str.len() > 20 # 19 chars in datetime str 330 if verbose and long_row_names.any(): 331 print(f"You may need to check index column for {sheet_name}") 332 index_column = index_column.loc[~long_row_names] 333 sheet_data = sheet_data.loc[~long_row_names] 334 335 proposed_index = pd.to_datetime(index_column) # 336 337 # get the correct period index 338 short_name = sheet_name.split(HYPHEN, 1)[0] 339 series_id = sheet_data.columns[0] 340 freq = ( 341 abs_meta[abs_meta[metacol.table] == short_name] 342 .at[series_id, metacol.freq] 343 .upper() 344 .strip()[0] 345 ) 346 freq = "Y" if freq == "A" else freq # pandas prefers yearly 347 freq = "Q" if freq == "B" else freq # treat Biannual as quarterly 348 if freq not in ("Y", "Q", "M", "D"): 349 print(f"Check the frequency of the data in sheet: {sheet_name}") 350 351 # create an appropriate period index 352 if freq: 353 if freq in ("Q", "Y"): 354 month = calendar.month_abbr[proposed_index.dt.month.max()].upper() 355 freq = f"{freq}-{month}" 356 sheet_data.index = pd.PeriodIndex(proposed_index, freq=freq) 357 else: 358 raise ValueError(f"With sheet {sheet_name} could not determime PeriodIndex") 359 360 return sheet_data 361 362 363def _capture_meta( 364 cat: str, 365 from_dict: dict[str, DataFrame], 366 index_sheet: str, 367) -> DataFrame: 368 """Capture the metadata from the Index sheet of an ABS excel file. 369 Returns a DataFrame specific to the current excel file. 370 Returning an empty DataFrame, means that the meta data could not 371 be identified. Meta data for each ABS data item is organised by row.""" 372 373 # --- step 0: set up --- 374 frame = from_dict[index_sheet] 375 376 # --- step 1: check if the metadata is present in the right place --- 377 # Unfortunately, the header for some of the 3401.0 378 # spreadsheets starts on row 10 379 starting_rows = 8, 9, 10 380 required = metacol.did, metacol.id, metacol.stype, metacol.unit 381 required_set = set(required) 382 all_good = False 383 for header_row in starting_rows: 384 header_columns = frame.iloc[header_row] 385 if required_set.issubset(set(header_columns)): 386 all_good = True 387 break 388 389 if not all_good: 390 print(f"Table has no metadata in sheet {index_sheet}.") 391 return DataFrame() 392 393 # --- step 2: capture the metadata --- 394 file_meta = frame.iloc[header_row + 1 :].copy() 395 file_meta.columns = pd.Index(header_columns) 396 397 # make damn sure there are no rogue white spaces 398 for col in required: 399 file_meta[col] = file_meta[col].str.strip() 400 401 # remove empty columns and rows 402 file_meta = file_meta.dropna(how="all", axis=1).dropna(how="all", axis=0) 403 404 # populate the metadata 405 file_meta[metacol.table] = index_sheet.split(HYPHEN, 1)[0] 406 tab_desc = frame.iat[4, 1].split(".", 1)[-1].strip() 407 file_meta[metacol.tdesc] = tab_desc 408 file_meta[metacol.cat] = cat 409 410 # drop last row - should just be copyright statement 411 file_meta = file_meta.iloc[:-1] 412 413 # set the index to the series_id 414 file_meta.index = pd.Index(file_meta[metacol.id]) 415 416 return file_meta 417 418 419def _group_sheets( 420 abs_dict: dict[str, DataFrame], 421) -> dict[str, list[str]]: 422 """Group the sheets from an Excel file.""" 423 424 keys = list(abs_dict.keys()) 425 long_pairs = [(x.split(HYPHEN, 1)[0], x) for x in keys] 426 427 def group(p_list: list[tuple[str, str]]) -> dict[str, list[str]]: 428 groups: dict[str, list[str]] = {} 429 for x, y in p_list: 430 if x not in groups: 431 groups[x] = [] 432 groups[x].append(y) 433 return groups 434 435 return group(long_pairs) 436 437 438# --- initial testing --- 439if __name__ == "__main__": 440 441 def simple_test(): 442 """A simple test of the read_abs_cat function.""" 443 444 # ABS Catalogue ID 8731.0 has a mix of time 445 # series and non-time series data. Also, 446 # it has unusually structured Excel files. So, a good test. 447 448 print("Starting test.") 449 450 d, _m = read_abs_cat("8731.0", keep_non_ts=False, verbose=False) 451 print(f"--- {len(d)=} ---") 452 print(f"--- {d.keys()=} ---") 453 for table in d.keys(): 454 print(f"{table=} {d[table].shape=} {d[table].index.freqstr=}") 455 456 print("Test complete.") 457 458 simple_test()
23@cache # minimise slowness for any repeat business 24def read_abs_cat( 25 cat: str, 26 keep_non_ts: bool = False, 27 **kwargs: Any, 28) -> tuple[dict[str, DataFrame], DataFrame]: 29 """This function returns the complete ABS Catalogue information as a 30 python dictionary of pandas DataFrames, as well as the associated metadata 31 in a separate DataFrame. The function automates the collection of zip and 32 excel files from the ABS website. If necessary, these files are downloaded, 33 and saved into a cache directory. The files are then parsed to extract time 34 series data, and the associated metadata. 35 36 By default, the cache directory is `./.readabs_cache/`. You can change the 37 default directory name by setting the shell environment variable 38 `READABS_CACHE_DIR` with the name of the preferred directory. 39 40 Parameters 41 ---------- 42 43 cat : str 44 The ABS Catalogue Number for the data to be downloaded and made 45 available by this function. This argument must be specified in the 46 function call. 47 48 keep_non_ts : bool = False 49 A flag for whether to keep the non-time-series tables 50 that might form part of an ABS catalogue item. Normally, the 51 non-time-series information is ignored, and not made available to 52 the user. 53 54 **kwargs : Any 55 The following parameters may be passed as optional keyword arguments. 56 57 history : str = "" 58 Orovide a month-year string to extract historical ABS data. 59 For example, you can set history="dec-2023" to the get the ABS data 60 for a catalogue identifier that was originally published in respect 61 of Q4 of 2023. Note: not all ABS data sources are structured so that 62 this technique works in every case; but most are. 63 64 verbose : bool = False 65 Setting this to true may help diagnose why something 66 might be going wrong with the data retrieval process. 67 68 ignore_errors : bool = False 69 Normally, this function will cease downloading when 70 an error in encountered. However, sometimes the ABS website has 71 malformed links, and changing this setting is necessitated. (Note: 72 if you drop a message to the ABS, they will usually fix broken 73 links with a business day). 74 75 get_zip : bool = True 76 Download the excel files in .zip files. 77 78 get_excel_if_no_zip : bool = True 79 Only try to download .xlsx files if there are no zip 80 files available to be downloaded. Only downloading individual excel 81 files when there are no zip files to download can speed up the 82 download process. 83 84 get_excel : bool = False 85 The default value means that excel files are not 86 automatically download. Note: at least one of `get_zip`, 87 `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS 88 catalogue items, it is sufficient to just download the one zip 89 file. But note, some catalogue items do not have a zip file. 90 Others have quite a number of zip files. 91 92 single_excel_only : str = "" 93 If this argument is set to a table name (without the 94 .xlsx extension), only that excel file will be downloaded. If 95 set, and only a limited subset of available data is needed, 96 this can speed up download times significantly. Note: overrides 97 `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`. 98 99 single_zip_only : str = "" 100 If this argument is set to a zip file name (without 101 the .zip extension), only that zip file will be downloaded. 102 If set, and only a limited subset of available data is needed, 103 this can speed up download times significantly. Note: overrides 104 `get_zip`, `get_excel_if_no_zip`, and `get_excel`. 105 106 cache_only : bool = False 107 If set to True, this function will only access 108 data that has been previously cached. Normally, the function 109 checks the date of the cache data against the date of the data 110 on the ABS website, before deciding whether the ABS has fresher 111 data that needs to be downloaded to the cache. 112 113 Returns 114 ------------- 115 tuple[dict[str, DataFrame], DataFrame] 116 The function returns a tuple of two items. The first item is a 117 python dictionary of pandas DataFrames (which is the primary data 118 associated with the ABS catalogue item). The second item is a 119 DataFrame of ABS metadata for the ABS collection. 120 121 Example 122 ------- 123 124 ```python 125 import readabs as ra 126 from pandas import DataFrame 127 cat_num = "6202.0" # The ABS labour force survey 128 data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num) 129 abs_dict, meta = data 130 ```""" 131 132 # --- get the time series data --- 133 raw_abs_dict = grab_abs_url(cat=cat, **kwargs) 134 abs_dict, abs_meta = _get_time_series_data( 135 cat, raw_abs_dict, keep_non_ts=keep_non_ts, **kwargs 136 ) 137 138 return abs_dict, abs_meta
This function returns the complete ABS Catalogue information as a python dictionary of pandas DataFrames, as well as the associated metadata in a separate DataFrame. The function automates the collection of zip and excel files from the ABS website. If necessary, these files are downloaded, and saved into a cache directory. The files are then parsed to extract time series data, and the associated metadata.
By default, the cache directory is ./.readabs_cache/
. You can change the
default directory name by setting the shell environment variable
READABS_CACHE_DIR
with the name of the preferred directory.
Parameters
cat : str The ABS Catalogue Number for the data to be downloaded and made available by this function. This argument must be specified in the function call.
keep_non_ts : bool = False A flag for whether to keep the non-time-series tables that might form part of an ABS catalogue item. Normally, the non-time-series information is ignored, and not made available to the user.
**kwargs : Any The following parameters may be passed as optional keyword arguments.
history : str = "" Orovide a month-year string to extract historical ABS data. For example, you can set history="dec-2023" to the get the ABS data for a catalogue identifier that was originally published in respect of Q4 of 2023. Note: not all ABS data sources are structured so that this technique works in every case; but most are.
verbose : bool = False Setting this to true may help diagnose why something might be going wrong with the data retrieval process.
ignore_errors : bool = False Normally, this function will cease downloading when an error in encountered. However, sometimes the ABS website has malformed links, and changing this setting is necessitated. (Note: if you drop a message to the ABS, they will usually fix broken links with a business day).
get_zip : bool = True Download the excel files in .zip files.
get_excel_if_no_zip : bool = True Only try to download .xlsx files if there are no zip files available to be downloaded. Only downloading individual excel files when there are no zip files to download can speed up the download process.
get_excel : bool = False
The default value means that excel files are not
automatically download. Note: at least one of get_zip
,
get_excel_if_no_zip
, or get_excel
must be true. For most ABS
catalogue items, it is sufficient to just download the one zip
file. But note, some catalogue items do not have a zip file.
Others have quite a number of zip files.
single_excel_only : str = ""
If this argument is set to a table name (without the
.xlsx extension), only that excel file will be downloaded. If
set, and only a limited subset of available data is needed,
this can speed up download times significantly. Note: overrides
get_zip
, get_excel_if_no_zip
, get_excel
and single_zip_only
.
single_zip_only : str = ""
If this argument is set to a zip file name (without
the .zip extension), only that zip file will be downloaded.
If set, and only a limited subset of available data is needed,
this can speed up download times significantly. Note: overrides
get_zip
, get_excel_if_no_zip
, and get_excel
.
cache_only : bool = False If set to True, this function will only access data that has been previously cached. Normally, the function checks the date of the cache data against the date of the data on the ABS website, before deciding whether the ABS has fresher data that needs to be downloaded to the cache.
Returns
tuple[dict[str, DataFrame], DataFrame] The function returns a tuple of two items. The first item is a python dictionary of pandas DataFrames (which is the primary data associated with the ABS catalogue item). The second item is a DataFrame of ABS metadata for the ABS collection.
Example
import readabs as ra
from pandas import DataFrame
cat_num = "6202.0" # The ABS labour force survey
data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
abs_dict, meta = data