readabs.read_abs_cat

Download timeseries data from the Australian Bureau of Statistics (ABS) for a specified ABS catalogue identifier.

  1"""Download *timeseries* data from the Australian Bureau 
  2of Statistics (ABS) for a specified ABS catalogue identifier."""
  3
  4# --- imports ---
  5# standard library imports
  6from functools import cache
  7from typing import Any
  8import calendar
  9
 10# analytic imports
 11import pandas as pd
 12from pandas import DataFrame
 13
 14# local imports
 15from readabs.abs_meta_data import metacol
 16from readabs.read_support import HYPHEN
 17from readabs.grab_abs_url import grab_abs_url
 18
 19
 20# --- functions ---
 21# - public -
 22@cache  # minimise slowness for any repeat business
 23def read_abs_cat(
 24    cat: str,
 25    keep_non_ts: bool = False,
 26    **kwargs: Any,
 27) -> tuple[dict[str, DataFrame], DataFrame]:
 28    """This function returns the complete ABS Catalogue information as a
 29    python dictionary of pandas DataFrames, as well as the associated metadata
 30    in a separate DataFrame. The function automates the collection of zip and
 31    excel files from the ABS website. If necessary, these files are downloaded,
 32    and saved into a cache directory. The files are then parsed to extract time
 33    series data, and the associated metadata.
 34
 35    By default, the cache directory is `./.readabs_cache/`. You can change the
 36    default directory name by setting the shell environment variable
 37    `READABS_CACHE_DIR` with the name of the preferred directory.
 38
 39    Parameters
 40    ----------
 41
 42    cat : str
 43        The ABS Catalogue Number for the data to be downloaded and made
 44        available by this function. This argument must be specified in the
 45        function call.
 46
 47    keep_non_ts : bool = False
 48        A flag for whether to keep the non-time-series tables
 49        that might form part of an ABS catalogue item. Normally, the
 50        non-time-series information is ignored, and not made available to
 51        the user.
 52
 53    **kwargs : Any
 54        The following parameters may be passed as optional keyword arguments.
 55
 56    history : str = ""
 57        Orovide a month-year string to extract historical ABS data.
 58        For example, you can set history="dec-2023" to the get the ABS data
 59        for a catalogue identifier that was originally published in respect
 60        of Q4 of 2023. Note: not all ABS data sources are structured so that
 61        this technique works in every case; but most are.
 62
 63    verbose : bool = False
 64        Setting this to true may help diagnose why something
 65        might be going wrong with the data retrieval process.
 66
 67    ignore_errors : bool = False
 68        Normally, this function will cease downloading when
 69        an error in encountered. However, sometimes the ABS website has
 70        malformed links, and changing this setting is necessitated. (Note:
 71        if you drop a message to the ABS, they will usually fix broken
 72        links with a business day).
 73
 74    get_zip : bool = True
 75        Download the excel files in .zip files.
 76
 77    get_excel_if_no_zip : bool = True
 78        Only try to download .xlsx files if there are no zip
 79        files available to be downloaded. Only downloading individual excel
 80        files when there are no zip files to download can speed up the
 81        download process.
 82
 83    get_excel : bool = False
 84        The default value means that excel files are not
 85        automatically download. Note: at least one of `get_zip`,
 86        `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS
 87        catalogue items, it is sufficient to just download the one zip
 88        file. But note, some catalogue items do not have a zip file.
 89        Others have quite a number of zip files.
 90
 91    single_excel_only : str = ""
 92        If this argument is set to a table name (without the
 93        .xlsx extension), only that excel file will be downloaded. If
 94        set, and only a limited subset of available data is needed,
 95        this can speed up download times significantly. Note: overrides
 96        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`.
 97
 98    single_zip_only : str = ""
 99        If this argument is set to a zip file name (without
100        the .zip extension), only that zip file will be downloaded.
101        If set, and only a limited subset of available data is needed,
102        this can speed up download times significantly. Note: overrides
103        `get_zip`, `get_excel_if_no_zip`, and `get_excel`.
104
105    cache_only : bool = False
106        If set to True, this function will only access
107        data that has been previously cached. Normally, the function
108        checks the date of the cache data against the date of the data
109        on the ABS website, before deciding whether the ABS has fresher
110        data that needs to be downloaded to the cache.
111
112    Returns
113    -------------
114    tuple[dict[str, DataFrame], DataFrame]
115        The function returns a tuple of two items. The first item is a
116        python dictionary of pandas DataFrames (which is the primary data
117        associated with the ABS catalogue item). The second item is a
118        DataFrame of ABS metadata for the ABS collection.
119
120    Example
121    -------
122
123    ```python
124    import readabs as ra
125    from pandas import DataFrame
126    cat_num = "6202.0"  # The ABS labour force survey
127    data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
128    abs_dict, meta = data
129    ```"""
130
131    # --- get the time series data ---
132    raw_abs_dict = grab_abs_url(cat=cat, **kwargs)
133    abs_dict, abs_meta = _get_time_series_data(
134        cat, raw_abs_dict, keep_non_ts=keep_non_ts, **kwargs
135    )
136
137    return abs_dict, abs_meta
138
139
140# - private -
141def _get_time_series_data(
142    cat: str,
143    abs_dict: dict[str, DataFrame],
144    **kwargs: Any,
145) -> tuple[dict[str, DataFrame], DataFrame]:
146    """Using the raw DataFrames from the ABS website, extract the time series
147    data for a specific ABS catalogue identifier. The data is returned in a
148    tuple. The first element is a dictionary of DataFrames, where each
149    DataFrame contains the time series data. The second element is a DataFrame
150    of meta data, which describes each data item in the dictionary"""
151
152    # --- set up ---
153    new_dict: dict[str, DataFrame] = {}
154    meta_data = DataFrame()
155
156    # --- group the sheets and iterate over these groups
157    long_groups = _group_sheets(abs_dict)
158    for table, sheets in long_groups.items():
159        args = {
160            "cat": cat,
161            "from_dict": abs_dict,
162            "table": table,
163            "long_sheets": sheets,
164        }
165        new_dict, meta_data = _capture(new_dict, meta_data, args, **kwargs)
166    return new_dict, meta_data
167
168
169def _copy_raw_sheets(
170    from_dict: dict[str, DataFrame],
171    long_sheets: list[str],
172    to_dict: dict[str, DataFrame],
173    keep_non_ts,
174) -> dict[str, DataFrame]:
175    """A utility function to copy the raw sheets across to
176    the final dictionary. Used if the data is not in a
177    timeseries format, and keep_non_ts flag is set to True.
178    Returns an updated final dictionary."""
179
180    if not keep_non_ts:
181        return to_dict
182
183    for sheet in long_sheets:
184        if sheet in from_dict:
185            to_dict[sheet] = from_dict[sheet]
186        else:
187            # should not happen
188            raise ValueError(f"Glitch: Sheet {sheet} not found in the data.")
189    return to_dict
190
191
192def _capture(
193    to_dict: dict[str, DataFrame],
194    meta_data: DataFrame,
195    args: dict[str, Any],
196    **kwargs: Any,
197) -> tuple[dict[str, DataFrame], DataFrame]:
198    """For a specific Excel file, capture *both* the time series data
199    from the ABS data files as well as the meta data. These data are
200    added to the input 'to_dict" and 'meta_data' respectively, and
201    the combined results are returned as a tuple."""
202
203    # --- step 0: set up ---
204    keep_non_ts: bool = kwargs.get("keep_non_ts", False)
205    ignore_errors: bool = kwargs.get("ignore_errors", False)
206
207    # --- step 1: capture the meta data ---
208    short_names = [x.split(HYPHEN, 1)[1] for x in args["long_sheets"]]
209    if "Index" not in short_names:
210        print(f"Table {args["table"]} has no 'Index' sheet.")
211        to_dict = _copy_raw_sheets(
212            args["from_dict"], args["long_sheets"], to_dict, keep_non_ts
213        )
214        return to_dict, meta_data
215    index = short_names.index("Index")
216
217    index_sheet = args["long_sheets"][index]
218    this_meta = _capture_meta(args["cat"], args["from_dict"], index_sheet)
219    if this_meta.empty:
220        to_dict = _copy_raw_sheets(
221            args["from_dict"], args["long_sheets"], to_dict, keep_non_ts
222        )
223        return to_dict, meta_data
224
225    meta_data = pd.concat([meta_data, this_meta], axis=0)
226
227    # --- step 2: capture the actual time series data ---
228    data = _capture_data(meta_data, args["from_dict"], args["long_sheets"], **kwargs)
229    if len(data):
230        to_dict[args["table"]] = data
231    else:
232        # a glitch: we have the metadata but not the actual data
233        error = f"Unexpected: {args["table"]} has no actual data."
234        if not ignore_errors:
235            raise ValueError(error)
236        print(error)
237        to_dict = _copy_raw_sheets(
238            args["from_dict"], args["long_sheets"], to_dict, keep_non_ts
239        )
240
241    return to_dict, meta_data
242
243
244def _capture_data(
245    abs_meta: DataFrame,
246    from_dict: dict[str, DataFrame],
247    long_sheets: list[str],
248    **kwargs: Any,
249) -> DataFrame:
250    """Take a list of ABS data sheets, find the DataFrames for those sheets in the
251    from_dict, and stitch them into a single DataFrame with an appropriate
252    PeriodIndex."""
253
254    # --- step 0: set up ---
255    verbose: bool = kwargs.get("verbose", False)
256    merged_data = DataFrame()
257    header_row: int = 8
258
259    # --- step 1: capture the time series data ---
260    # identify the data sheets in the list of all sheets from Excel file
261    data_sheets = [x for x in long_sheets if x.split(HYPHEN, 1)[1].startswith("Data")]
262
263    for sheet_name in data_sheets:
264        if verbose:
265            print(f"About to cature data from {sheet_name=}")
266
267        # --- capture just the data, nothing else
268        sheet_data = from_dict[sheet_name].copy()
269
270        # get the columns
271        header = sheet_data.iloc[header_row]
272        sheet_data.columns = pd.Index(header)
273        sheet_data = sheet_data[(header_row + 1) :]
274
275        # get the row indexes
276        sheet_data = _index_to_period(sheet_data, sheet_name, abs_meta, verbose)
277
278        # --- merge data into a single dataframe
279        if len(merged_data) == 0:
280            merged_data = sheet_data
281        else:
282            merged_data = pd.merge(
283                left=merged_data,
284                right=sheet_data,
285                how="outer",
286                left_index=True,
287                right_index=True,
288                suffixes=("", ""),
289            )
290
291    # --- step 2 - final tidy-ups
292    # remove NA rows
293    merged_data = merged_data.dropna(how="all")
294    # check for NA columns - rarely happens
295    # Note: these empty columns are not removed,
296    # but it is useful to know they are there
297    if merged_data.isna().all().any() and verbose:
298        cols = merged_data.columns[merged_data.isna().all()]
299        print(
300            "Caution: these columns are all NA in "
301            + f"{merged_data[metacol.table].iloc[0]}: {cols}"
302        )
303
304    # check for duplicate columns - should not happen
305    # Note: these duplicate columns are removed
306    duplicates = merged_data.columns.duplicated()
307    if duplicates.any():
308        if verbose:
309            dup_table = abs_meta[metacol.table].iloc[0]
310            print(
311                f"Note: duplicates removed from {dup_table}: "
312                + f"{merged_data.columns[duplicates]}"
313            )
314        merged_data = merged_data.loc[:, ~duplicates].copy()
315
316    # make the data all floats.
317    merged_data = merged_data.astype(float).sort_index()
318
319    return merged_data
320
321
322def _index_to_period(
323    sheet_data: DataFrame, sheet_name: str, abs_meta: DataFrame, verbose: bool
324) -> DataFrame:
325    """Convert the index of a DataFrame to a PeriodIndex."""
326
327    index_column = sheet_data[sheet_data.columns[0]].astype(str)
328    sheet_data = sheet_data.drop(sheet_data.columns[0], axis=1)
329    long_row_names = index_column.str.len() > 20  # 19 chars in datetime str
330    if verbose and long_row_names.any():
331        print(f"You may need to check index column for {sheet_name}")
332    index_column = index_column.loc[~long_row_names]
333    sheet_data = sheet_data.loc[~long_row_names]
334
335    proposed_index = pd.to_datetime(index_column)  #
336
337    # get the correct period index
338    short_name = sheet_name.split(HYPHEN, 1)[0]
339    series_id = sheet_data.columns[0]
340    freq = (
341        abs_meta[abs_meta[metacol.table] == short_name]
342        .at[series_id, metacol.freq]
343        .upper()
344        .strip()[0]
345    )
346    freq = "Y" if freq == "A" else freq  # pandas prefers yearly
347    freq = "Q" if freq == "B" else freq  # treat Biannual as quarterly
348    if freq not in ("Y", "Q", "M", "D"):
349        print(f"Check the frequency of the data in sheet: {sheet_name}")
350
351    # create an appropriate period index
352    if freq:
353        if freq in ("Q", "Y"):
354            month = calendar.month_abbr[proposed_index.dt.month.max()].upper()
355            freq = f"{freq}-{month}"
356        sheet_data.index = pd.PeriodIndex(proposed_index, freq=freq)
357    else:
358        raise ValueError(f"With sheet {sheet_name} could not determime PeriodIndex")
359
360    return sheet_data
361
362
363def _capture_meta(
364    cat: str,
365    from_dict: dict[str, DataFrame],
366    index_sheet: str,
367) -> DataFrame:
368    """Capture the metadata from the Index sheet of an ABS excel file.
369    Returns a DataFrame specific to the current excel file.
370    Returning an empty DataFrame, means that the meta data could not
371    be identified. Meta data for each ABS data item is organised by row."""
372
373    # --- step 0: set up ---
374    frame = from_dict[index_sheet]
375
376    # --- step 1: check if the metadata is present in the right place ---
377    # Unfortunately, the header for some of the 3401.0
378    #                spreadsheets starts on row 10
379    starting_rows = 8, 9, 10
380    required = metacol.did, metacol.id, metacol.stype, metacol.unit
381    required_set = set(required)
382    all_good = False
383    for header_row in starting_rows:
384        header_columns = frame.iloc[header_row]
385        if required_set.issubset(set(header_columns)):
386            all_good = True
387            break
388
389    if not all_good:
390        print(f"Table has no metadata in sheet {index_sheet}.")
391        return DataFrame()
392
393    # --- step 2: capture the metadata ---
394    file_meta = frame.iloc[header_row + 1 :].copy()
395    file_meta.columns = pd.Index(header_columns)
396
397    # make damn sure there are no rogue white spaces
398    for col in required:
399        file_meta[col] = file_meta[col].str.strip()
400
401    # remove empty columns and rows
402    file_meta = file_meta.dropna(how="all", axis=1).dropna(how="all", axis=0)
403
404    # populate the metadata
405    file_meta[metacol.table] = index_sheet.split(HYPHEN, 1)[0]
406    tab_desc = frame.iat[4, 1].split(".", 1)[-1].strip()
407    file_meta[metacol.tdesc] = tab_desc
408    file_meta[metacol.cat] = cat
409
410    # drop last row - should just be copyright statement
411    file_meta = file_meta.iloc[:-1]
412
413    # set the index to the series_id
414    file_meta.index = pd.Index(file_meta[metacol.id])
415
416    return file_meta
417
418
419def _group_sheets(
420    abs_dict: dict[str, DataFrame],
421) -> dict[str, list[str]]:
422    """Group the sheets from an Excel file."""
423
424    keys = list(abs_dict.keys())
425    long_pairs = [(x.split(HYPHEN, 1)[0], x) for x in keys]
426
427    def group(p_list: list[tuple[str, str]]) -> dict[str, list[str]]:
428        groups: dict[str, list[str]] = {}
429        for x, y in p_list:
430            if x not in groups:
431                groups[x] = []
432            groups[x].append(y)
433        return groups
434
435    return group(long_pairs)
436
437
438# --- initial testing ---
439if __name__ == "__main__":
440
441    def simple_test():
442        """A simple test of the read_abs_cat function."""
443
444        # ABS Catalogue ID 8731.0 has a mix of time
445        # series and non-time series data. Also,
446        # it has unusually structured Excel files. So, a good test.
447
448        print("Starting test.")
449
450        d, _m = read_abs_cat("8731.0", keep_non_ts=False, verbose=False)
451        print(f"--- {len(d)=} ---")
452        print(f"--- {d.keys()=} ---")
453        for table in d.keys():
454            print(f"{table=} {d[table].shape=} {d[table].index.freqstr=}")
455
456        print("Test complete.")
457
458    simple_test()
@cache
def read_abs_cat( cat: str, keep_non_ts: bool = False, **kwargs: Any) -> tuple[dict[str, pandas.core.frame.DataFrame], pandas.core.frame.DataFrame]:
 23@cache  # minimise slowness for any repeat business
 24def read_abs_cat(
 25    cat: str,
 26    keep_non_ts: bool = False,
 27    **kwargs: Any,
 28) -> tuple[dict[str, DataFrame], DataFrame]:
 29    """This function returns the complete ABS Catalogue information as a
 30    python dictionary of pandas DataFrames, as well as the associated metadata
 31    in a separate DataFrame. The function automates the collection of zip and
 32    excel files from the ABS website. If necessary, these files are downloaded,
 33    and saved into a cache directory. The files are then parsed to extract time
 34    series data, and the associated metadata.
 35
 36    By default, the cache directory is `./.readabs_cache/`. You can change the
 37    default directory name by setting the shell environment variable
 38    `READABS_CACHE_DIR` with the name of the preferred directory.
 39
 40    Parameters
 41    ----------
 42
 43    cat : str
 44        The ABS Catalogue Number for the data to be downloaded and made
 45        available by this function. This argument must be specified in the
 46        function call.
 47
 48    keep_non_ts : bool = False
 49        A flag for whether to keep the non-time-series tables
 50        that might form part of an ABS catalogue item. Normally, the
 51        non-time-series information is ignored, and not made available to
 52        the user.
 53
 54    **kwargs : Any
 55        The following parameters may be passed as optional keyword arguments.
 56
 57    history : str = ""
 58        Orovide a month-year string to extract historical ABS data.
 59        For example, you can set history="dec-2023" to the get the ABS data
 60        for a catalogue identifier that was originally published in respect
 61        of Q4 of 2023. Note: not all ABS data sources are structured so that
 62        this technique works in every case; but most are.
 63
 64    verbose : bool = False
 65        Setting this to true may help diagnose why something
 66        might be going wrong with the data retrieval process.
 67
 68    ignore_errors : bool = False
 69        Normally, this function will cease downloading when
 70        an error in encountered. However, sometimes the ABS website has
 71        malformed links, and changing this setting is necessitated. (Note:
 72        if you drop a message to the ABS, they will usually fix broken
 73        links with a business day).
 74
 75    get_zip : bool = True
 76        Download the excel files in .zip files.
 77
 78    get_excel_if_no_zip : bool = True
 79        Only try to download .xlsx files if there are no zip
 80        files available to be downloaded. Only downloading individual excel
 81        files when there are no zip files to download can speed up the
 82        download process.
 83
 84    get_excel : bool = False
 85        The default value means that excel files are not
 86        automatically download. Note: at least one of `get_zip`,
 87        `get_excel_if_no_zip`, or `get_excel` must be true. For most ABS
 88        catalogue items, it is sufficient to just download the one zip
 89        file. But note, some catalogue items do not have a zip file.
 90        Others have quite a number of zip files.
 91
 92    single_excel_only : str = ""
 93        If this argument is set to a table name (without the
 94        .xlsx extension), only that excel file will be downloaded. If
 95        set, and only a limited subset of available data is needed,
 96        this can speed up download times significantly. Note: overrides
 97        `get_zip`, `get_excel_if_no_zip`, `get_excel` and `single_zip_only`.
 98
 99    single_zip_only : str = ""
100        If this argument is set to a zip file name (without
101        the .zip extension), only that zip file will be downloaded.
102        If set, and only a limited subset of available data is needed,
103        this can speed up download times significantly. Note: overrides
104        `get_zip`, `get_excel_if_no_zip`, and `get_excel`.
105
106    cache_only : bool = False
107        If set to True, this function will only access
108        data that has been previously cached. Normally, the function
109        checks the date of the cache data against the date of the data
110        on the ABS website, before deciding whether the ABS has fresher
111        data that needs to be downloaded to the cache.
112
113    Returns
114    -------------
115    tuple[dict[str, DataFrame], DataFrame]
116        The function returns a tuple of two items. The first item is a
117        python dictionary of pandas DataFrames (which is the primary data
118        associated with the ABS catalogue item). The second item is a
119        DataFrame of ABS metadata for the ABS collection.
120
121    Example
122    -------
123
124    ```python
125    import readabs as ra
126    from pandas import DataFrame
127    cat_num = "6202.0"  # The ABS labour force survey
128    data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
129    abs_dict, meta = data
130    ```"""
131
132    # --- get the time series data ---
133    raw_abs_dict = grab_abs_url(cat=cat, **kwargs)
134    abs_dict, abs_meta = _get_time_series_data(
135        cat, raw_abs_dict, keep_non_ts=keep_non_ts, **kwargs
136    )
137
138    return abs_dict, abs_meta

This function returns the complete ABS Catalogue information as a python dictionary of pandas DataFrames, as well as the associated metadata in a separate DataFrame. The function automates the collection of zip and excel files from the ABS website. If necessary, these files are downloaded, and saved into a cache directory. The files are then parsed to extract time series data, and the associated metadata.

By default, the cache directory is ./.readabs_cache/. You can change the default directory name by setting the shell environment variable READABS_CACHE_DIR with the name of the preferred directory.

Parameters

cat : str The ABS Catalogue Number for the data to be downloaded and made available by this function. This argument must be specified in the function call.

keep_non_ts : bool = False A flag for whether to keep the non-time-series tables that might form part of an ABS catalogue item. Normally, the non-time-series information is ignored, and not made available to the user.

**kwargs : Any The following parameters may be passed as optional keyword arguments.

history : str = "" Orovide a month-year string to extract historical ABS data. For example, you can set history="dec-2023" to the get the ABS data for a catalogue identifier that was originally published in respect of Q4 of 2023. Note: not all ABS data sources are structured so that this technique works in every case; but most are.

verbose : bool = False Setting this to true may help diagnose why something might be going wrong with the data retrieval process.

ignore_errors : bool = False Normally, this function will cease downloading when an error in encountered. However, sometimes the ABS website has malformed links, and changing this setting is necessitated. (Note: if you drop a message to the ABS, they will usually fix broken links with a business day).

get_zip : bool = True Download the excel files in .zip files.

get_excel_if_no_zip : bool = True Only try to download .xlsx files if there are no zip files available to be downloaded. Only downloading individual excel files when there are no zip files to download can speed up the download process.

get_excel : bool = False The default value means that excel files are not automatically download. Note: at least one of get_zip, get_excel_if_no_zip, or get_excel must be true. For most ABS catalogue items, it is sufficient to just download the one zip file. But note, some catalogue items do not have a zip file. Others have quite a number of zip files.

single_excel_only : str = "" If this argument is set to a table name (without the .xlsx extension), only that excel file will be downloaded. If set, and only a limited subset of available data is needed, this can speed up download times significantly. Note: overrides get_zip, get_excel_if_no_zip, get_excel and single_zip_only.

single_zip_only : str = "" If this argument is set to a zip file name (without the .zip extension), only that zip file will be downloaded. If set, and only a limited subset of available data is needed, this can speed up download times significantly. Note: overrides get_zip, get_excel_if_no_zip, and get_excel.

cache_only : bool = False If set to True, this function will only access data that has been previously cached. Normally, the function checks the date of the cache data against the date of the data on the ABS website, before deciding whether the ABS has fresher data that needs to be downloaded to the cache.

Returns

tuple[dict[str, DataFrame], DataFrame] The function returns a tuple of two items. The first item is a python dictionary of pandas DataFrames (which is the primary data associated with the ABS catalogue item). The second item is a DataFrame of ABS metadata for the ABS collection.

Example

import readabs as ra
from pandas import DataFrame
cat_num = "6202.0"  # The ABS labour force survey
data: tuple[dict[str, DataFrame], DataFrame] = ra.read_abs_cat(cat=cat_num)
abs_dict, meta = data