Coverage for cc_modules/cc_export.py : 32%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
1#!/usr/bin/env python
3# noinspection HttpUrlsUsage
4"""
5camcops_server/cc_modules/cc_export.py
7===============================================================================
9 Copyright (C) 2012-2020 Rudolf Cardinal (rudolf@pobox.com).
11 This file is part of CamCOPS.
13 CamCOPS is free software: you can redistribute it and/or modify
14 it under the terms of the GNU General Public License as published by
15 the Free Software Foundation, either version 3 of the License, or
16 (at your option) any later version.
18 CamCOPS is distributed in the hope that it will be useful,
19 but WITHOUT ANY WARRANTY; without even the implied warranty of
20 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
21 GNU General Public License for more details.
23 You should have received a copy of the GNU General Public License
24 along with CamCOPS. If not, see <https://www.gnu.org/licenses/>.
26===============================================================================
28.. _ActiveMQ: https://activemq.apache.org/
29.. _AMQP: https://www.amqp.org/
30.. _APScheduler: https://apscheduler.readthedocs.io/
31.. _Celery: https://www.celeryproject.org/
32.. _Dramatiq: https://dramatiq.io/
33.. _RabbitMQ: https://www.rabbitmq.com/
34.. _Redis: https://redis.io/
35.. _ZeroMQ: https://zeromq.org/
37**Export and research dump functions.**
39Export design:
41*WHICH RECORDS TO SEND?*
43The most powerful mechanism is not to have a sending queue (which would then
44require careful multi-instance locking), but to have a "sent" log. That way:
46- A record needs sending if it's not in the sent log (for an appropriate
47 recipient).
48- You can add a new recipient and the system will know about the (new)
49 backlog automatically.
50- You can specify criteria, e.g. don't upload records before 1/1/2014, and
51 modify that later, and it would catch up with the backlog.
52- Successes and failures are logged in the same table.
53- Multiple recipients are handled with ease.
54- No need to alter database.pl code that receives from tablets.
55- Can run with a simple cron job.
57*LOCKING*
59- Don't use database locking:
60 https://blog.engineyard.com/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you
61- Locking via UNIX lockfiles:
63 - https://pypi.python.org/pypi/lockfile
64 - http://pythonhosted.org/lockfile/ (which also works on Windows)
66 - On UNIX, ``lockfile`` uses ``LinkLockFile``:
67 https://github.com/smontanaro/pylockfile/blob/master/lockfile/linklockfile.py
69*MESSAGE QUEUE AND BACKEND*
71Thoughts as of 2018-12-22.
73- See https://www.fullstackpython.com/task-queues.html. Also http://queues.io/;
74 https://stackoverflow.com/questions/731233/activemq-or-rabbitmq-or-zeromq-or.
76- The "default" is Celery_, with ``celery beat`` for scheduling, via an
77 AMQP_ broker like RabbitMQ_.
79 - Downside: no longer supported under Windows as of Celery 4.
81 - There are immediate bugs when running the demo code with Celery 4.2.1,
82 fixed by setting the environment variable ``set
83 FORKED_BY_MULTIPROCESSING=1`` before running the worker; see
84 https://github.com/celery/celery/issues/4178 and
85 https://github.com/celery/celery/pull/4078.
87 - Downside: backend is complex; e.g. Erlang dependency of RabbitMQ.
89 - Celery also supports Redis_, but Redis_ doesn't support Windows directly
90 (except the Windows Subsystem for Linux in Windows 10+).
92- Another possibility is Dramatiq_ with APScheduler_.
94 - Of note, APScheduler_ can use an SQLAlchemy database table as its job
95 store, which might be good.
96 - Dramatiq_ uses RabbitMQ_ or Redis_.
97 - Dramatiq_ 1.4.0 (2018-11-25) installs cleanly under Windows. Use ``pip
98 install --upgrade "dramatic[rabbitmq, watch]"`` (i.e. with double quotse,
99 not the single quotes it suggests, which don't work under Windows).
100 - However, the basic example (https://dramatiq.io/guide.html) fails under
101 Windows; when you fire up ``dramatic count_words`` (even with ``--processes
102 1 --threads 1``) it crashes with an error from ``ForkingPickler`` in
103 ``multiprocessing.reduction``, i.e.
104 https://docs.python.org/3/library/multiprocessing.html#windows. It also
105 emits a ``PermissionError: [WinError 5] Access is denied``. This is
106 discussed a bit at https://github.com/Bogdanp/dramatiq/issues/75;
107 https://github.com/Bogdanp/dramatiq/blob/master/docs/source/changelog.rst.
108 The changelog suggests 1.4.0 should work, but it doesn't.
110- Worth some thought about ZeroMQ_, which is a very different sort of thing.
111 Very cross-platform. Needs work to guard against message loss (i.e. messages
112 are unreliable by default). Dynamic "special socket" style.
114- Possibly also ActiveMQ_.
116- OK; so speed is not critical but we want message reliability, for it to work
117 under Windows, and decent Python bindings with job scheduling.
119 - OUT: Redis (not Windows easily), ZeroMQ (fast but not by default reliable),
120 ActiveMQ (few Python frameworks?).
121 - REMAINING for message handling: RabbitMQ.
122 - Python options therefore: Celery (but Windows not officially supported from
123 4+); Dramatiq (but Windows also not very well supported and seems a bit
124 bleeding-edge).
126- This is looking like a mess from the Windows perspective.
128- An alternative is just to use the database, of course.
130 - https://softwareengineering.stackexchange.com/questions/351449/message-queue-database-vs-dedicated-mq
131 - http://mikehadlow.blogspot.com/2012/04/database-as-queue-anti-pattern.html
132 - https://blog.jooq.org/2014/09/26/using-your-rdbms-for-messaging-is-totally-ok/
133 - https://stackoverflow.com/questions/13005410/why-do-we-need-message-brokers-like-rabbitmq-over-a-database-like-postgresql
134 - https://www.quora.com/What-is-the-best-practice-using-db-tables-or-message-queues-for-moderation-of-content-approved-by-humans
136- Let's take a step back and summarize the problem.
138 - Many web threads may upload tasks. This should trigger a prompt export for
139 all push recipients.
140 - Whichever way we schedule a backend task job, it should be as the
141 combination of recipient, basetable, task PK. (That way, if one recipient
142 fails, the others can proceed independently.)
143 - Every job should check that it's not been completed already (in case of
144 accidental job restarts), i.e. is idempotent as far as we can make it.
145 - How should this interact with the non-push recipients?
146 - We should use the same locking method for push and non-push recipients.
147 - We should make the locking granular and use file locks -- for example, for
148 each task/recipient combination (or each whole-database export for a given
149 recipient).
151""" # noqa
153import logging
154import os
155import sqlite3
156import tempfile
157from typing import (Dict, List, Generator, Optional,
158 Tuple, Type, TYPE_CHECKING, Union)
160from cardinal_pythonlib.classes import gen_all_subclasses
161from cardinal_pythonlib.datetimefunc import (
162 format_datetime,
163 get_now_localtz_pendulum,
164 get_tz_local,
165 get_tz_utc,
166)
167from cardinal_pythonlib.email.sendmail import CONTENT_TYPE_TEXT
168from cardinal_pythonlib.fileops import relative_filename_within_dir
169from cardinal_pythonlib.json.serialize import register_for_json
170from cardinal_pythonlib.logs import BraceStyleAdapter
171from cardinal_pythonlib.pyramid.responses import (
172 OdsResponse,
173 SqliteBinaryResponse,
174 TextAttachmentResponse,
175 XlsxResponse,
176 ZipResponse,
177)
178from cardinal_pythonlib.sizeformatter import bytes2human
179from cardinal_pythonlib.sqlalchemy.session import get_safe_url_from_engine
180import lockfile
181from pendulum import DateTime as Pendulum, Duration, Period
182from pyramid.httpexceptions import HTTPBadRequest
183from pyramid.renderers import render_to_response
184from pyramid.response import Response
185from sqlalchemy.engine import create_engine
186from sqlalchemy.engine.result import ResultProxy
187from sqlalchemy.orm import Session as SqlASession, sessionmaker
188from sqlalchemy.sql.expression import text
189from sqlalchemy.sql.schema import Column, MetaData, Table
190from sqlalchemy.sql.sqltypes import Text
192from camcops_server.cc_modules.cc_audit import audit
193from camcops_server.cc_modules.cc_constants import DateFormat
194from camcops_server.cc_modules.cc_dump import copy_tasks_and_summaries
195from camcops_server.cc_modules.cc_email import Email
196from camcops_server.cc_modules.cc_exportmodels import (
197 ExportedTask,
198 ExportRecipient,
199 gen_tasks_having_exportedtasks,
200 get_collection_for_export,
201)
202from camcops_server.cc_modules.cc_forms import UserDownloadDeleteForm
203from camcops_server.cc_modules.cc_pyramid import Routes, ViewArg, ViewParam
204from camcops_server.cc_modules.cc_simpleobjects import TaskExportOptions
205from camcops_server.cc_modules.cc_sqlalchemy import sql_from_sqlite_database
206from camcops_server.cc_modules.cc_task import Task
207from camcops_server.cc_modules.cc_tsv import TsvCollection, TsvPage
208from camcops_server.cc_modules.celery import (
209 create_user_download,
210 email_basic_dump,
211 export_task_backend,
212)
214if TYPE_CHECKING:
215 from camcops_server.cc_modules.cc_request import CamcopsRequest
216 from camcops_server.cc_modules.cc_taskcollection import TaskCollection
218log = BraceStyleAdapter(logging.getLogger(__name__))
221# =============================================================================
222# Constants
223# =============================================================================
225INFOSCHEMA_PAGENAME = "_camcops_information_schema_columns"
228# =============================================================================
229# Export tasks from the back end
230# =============================================================================
232def print_export_queue(req: "CamcopsRequest",
233 recipient_names: List[str] = None,
234 all_recipients: bool = False,
235 via_index: bool = True,
236 pretty: bool = False) -> None:
237 """
238 Called from the command line.
240 Shows tasks that would be exported.
242 Args:
243 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
244 recipient_names: list of export recipient names (as per the config
245 file)
246 all_recipients: use all recipients?
247 via_index: use the task index (faster)?
248 pretty: use ``str(task)`` not ``repr(task)`` (prettier, slower because
249 it has to query the patient)
250 """
251 recipients = req.get_export_recipients(
252 recipient_names=recipient_names,
253 all_recipients=all_recipients,
254 save=False
255 )
256 if not recipients:
257 log.warning("No export recipients")
258 return
259 for recipient in recipients:
260 log.info("Tasks to be exported for recipient: {}", recipient)
261 collection = get_collection_for_export(req, recipient,
262 via_index=via_index)
263 for task in collection.gen_tasks_by_class():
264 print(
265 f"{recipient.recipient_name}: "
266 f"{str(task) if pretty else repr(task)}"
267 )
270def export(req: "CamcopsRequest",
271 recipient_names: List[str] = None,
272 all_recipients: bool = False,
273 via_index: bool = True,
274 schedule_via_backend: bool = False) -> None:
275 """
276 Called from the command line.
278 Exports all relevant tasks (pending incremental exports, or everything if
279 applicable) for specified export recipients.
281 Obtains a file lock, then iterates through all recipients.
283 Args:
284 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
285 recipient_names: list of export recipient names (as per the config
286 file)
287 all_recipients: use all recipients?
288 via_index: use the task index (faster)?
289 schedule_via_backend: schedule jobs via the backend instead?
290 """
291 recipients = req.get_export_recipients(recipient_names=recipient_names,
292 all_recipients=all_recipients)
293 if not recipients:
294 log.warning("No export recipients")
295 return
297 for recipient in recipients:
298 log.info("Exporting to recipient: {}", recipient)
299 if recipient.using_db():
300 if schedule_via_backend:
301 raise NotImplementedError() # todo: implement whole-database export via Celery backend # noqa
302 else:
303 export_whole_database(req, recipient, via_index=via_index)
304 else:
305 # Non-database recipient.
306 export_tasks_individually(
307 req, recipient,
308 via_index=via_index, schedule_via_backend=schedule_via_backend)
311def export_whole_database(req: "CamcopsRequest",
312 recipient: ExportRecipient,
313 via_index: bool = True) -> None:
314 """
315 Exports to a database.
317 Holds a recipient-specific file lock in the process.
319 Args:
320 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
321 recipient: an :class:`camcops_server.cc_modules.cc_exportmodels.ExportRecipient`
322 via_index: use the task index (faster)?
323 """ # noqa
324 cfg = req.config
325 lockfilename = cfg.get_export_lockfilename_db(
326 recipient_name=recipient.recipient_name)
327 try:
328 with lockfile.FileLock(lockfilename, timeout=0): # doesn't wait
329 collection = get_collection_for_export(req, recipient,
330 via_index=via_index)
331 dst_engine = create_engine(recipient.db_url,
332 echo=recipient.db_echo)
333 log.info("Exporting to database: {}",
334 get_safe_url_from_engine(dst_engine))
335 dst_session = sessionmaker(bind=dst_engine)() # type: SqlASession
336 task_generator = gen_tasks_having_exportedtasks(collection)
337 export_options = TaskExportOptions(
338 include_blobs=recipient.db_include_blobs,
339 db_patient_id_per_row=recipient.db_patient_id_per_row,
340 db_make_all_tables_even_empty=True,
341 db_include_summaries=recipient.db_add_summaries,
342 )
343 copy_tasks_and_summaries(
344 tasks=task_generator,
345 dst_engine=dst_engine,
346 dst_session=dst_session,
347 export_options=export_options,
348 req=req,
349 )
350 dst_session.commit()
351 except lockfile.AlreadyLocked:
352 log.warning("Export logfile {!r} already locked by another process; "
353 "aborting", lockfilename)
356def export_tasks_individually(req: "CamcopsRequest",
357 recipient: ExportRecipient,
358 via_index: bool = True,
359 schedule_via_backend: bool = False) -> None:
360 """
361 Exports all necessary tasks for a recipient.
363 Args:
364 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
365 recipient: an :class:`camcops_server.cc_modules.cc_exportmodels.ExportRecipient`
366 via_index: use the task index (faster)?
367 schedule_via_backend: schedule jobs via the backend instead?
368 """ # noqa
369 collection = get_collection_for_export(req, recipient, via_index=via_index)
370 if schedule_via_backend:
371 recipient_name = recipient.recipient_name
372 for task_or_index in collection.gen_all_tasks_or_indexes():
373 if isinstance(task_or_index, Task):
374 basetable = task_or_index.tablename
375 task_pk = task_or_index.pk
376 else:
377 basetable = task_or_index.task_table_name
378 task_pk = task_or_index.task_pk
379 log.info("Submitting background job to export task {}.{} to {}",
380 basetable, task_pk, recipient_name)
381 export_task_backend.delay(
382 recipient_name=recipient_name,
383 basetable=basetable,
384 task_pk=task_pk
385 )
386 else:
387 for task in collection.gen_tasks_by_class():
388 # Do NOT use this to check the working of export_task_backend():
389 # export_task_backend(recipient.recipient_name, task.tablename, task.pk) # noqa
390 # ... it will deadlock at the database (because we're already
391 # within a query of some sort, I presume)
392 export_task(req, recipient, task)
395def export_task(req: "CamcopsRequest",
396 recipient: ExportRecipient,
397 task: Task) -> None:
398 """
399 Exports a single task, checking that it remains valid to do so.
401 Args:
402 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
403 recipient: an :class:`camcops_server.cc_modules.cc_exportmodels.ExportRecipient`
404 task: a :class:`camcops_server.cc_modules.cc_task.Task`
405 """ # noqa
407 # Double-check it's OK! Just in case, for example, an old backend task has
408 # persisted, or someone's managed to get an iffy back-end request in some
409 # other way.
410 if not recipient.is_task_suitable(task):
411 # Warning will already have been emitted (by is_task_suitable).
412 return
414 cfg = req.config
415 lockfilename = cfg.get_export_lockfilename_task(
416 recipient_name=recipient.recipient_name,
417 basetable=task.tablename,
418 pk=task.pk,
419 )
420 dbsession = req.dbsession
421 try:
422 with lockfile.FileLock(lockfilename, timeout=0): # doesn't wait
423 # We recheck the export status once we hold the lock, in case
424 # multiple jobs are competing to export it.
425 if ExportedTask.task_already_exported(
426 dbsession=dbsession,
427 recipient_name=recipient.recipient_name,
428 basetable=task.tablename,
429 task_pk=task.pk):
430 log.info("Task {!r} already exported to recipient {!r}; "
431 "ignoring", task, recipient)
432 # Not a warning; it's normal to see these because it allows the
433 # client API to skip some checks for speed.
434 return
435 # OK; safe to export now.
436 et = ExportedTask(recipient, task)
437 dbsession.add(et)
438 et.export(req)
439 dbsession.commit() # so the ExportedTask is visible to others ASAP
440 except lockfile.AlreadyLocked:
441 log.warning("Export logfile {!r} already locked by another process; "
442 "aborting", lockfilename)
445# =============================================================================
446# Helpers for task collection export functions
447# =============================================================================
449def gen_audited_tasks_for_task_class(
450 collection: "TaskCollection",
451 cls: Type[Task],
452 audit_descriptions: List[str]) -> Generator[Task, None, None]:
453 """
454 Generates tasks from a collection, for a given task class, simultaneously
455 adding to an audit description. Used for user-triggered downloads.
457 Args:
458 collection: a :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
459 cls: the task class to generate
460 audit_descriptions: list of strings to be modified
462 Yields:
463 :class:`camcops_server.cc_modules.cc_task.Task` objects
464 """ # noqa
465 pklist = [] # type: List[int]
466 for task in collection.tasks_for_task_class(cls):
467 pklist.append(task.pk)
468 yield task
469 audit_descriptions.append(
470 f"{cls.__tablename__}: "
471 f"{','.join(str(pk) for pk in pklist)}"
472 )
475def gen_audited_tasks_by_task_class(
476 collection: "TaskCollection",
477 audit_descriptions: List[str]) -> Generator[Task, None, None]:
478 """
479 Generates tasks from a collection, across task classes, simultaneously
480 adding to an audit description. Used for user-triggered downloads.
482 Args:
483 collection: a :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
484 audit_descriptions: list of strings to be modified
486 Yields:
487 :class:`camcops_server.cc_modules.cc_task.Task` objects
488 """ # noqa
489 for cls in collection.task_classes():
490 for task in gen_audited_tasks_for_task_class(collection, cls,
491 audit_descriptions):
492 yield task
495def get_information_schema_query(req: "CamcopsRequest") -> ResultProxy:
496 """
497 Returns an SQLAlchemy query object that fetches the
498 INFORMATION_SCHEMA.COLUMNS information from our source database.
500 This is not sensitive; there is no data, just structure/comments.
501 """
502 # Find our database name
503 # https://stackoverflow.com/questions/53554458/sqlalchemy-get-database-name-from-engine
504 dbname = req.engine.url.database
505 # Query the information schema for our database.
506 # https://docs.sqlalchemy.org/en/13/core/sqlelement.html#sqlalchemy.sql.expression.text # noqa
507 query = text("""
508 SELECT *
509 FROM information_schema.columns
510 WHERE table_schema = :dbname
511 """).bindparams(dbname=dbname)
512 result_proxy = req.dbsession.execute(query)
513 return result_proxy
516def get_information_schema_tsv_page(
517 req: "CamcopsRequest",
518 page_name: str = INFOSCHEMA_PAGENAME) -> TsvPage:
519 """
520 Returns the server database's ``INFORMATION_SCHEMA.COLUMNS`` table as a
521 :class:`camcops_server.cc_modules.cc_tsv.TsvPage``.
522 """
523 result_proxy = get_information_schema_query(req)
524 return TsvPage.from_resultproxy(page_name, result_proxy)
527def write_information_schema_to_dst(
528 req: "CamcopsRequest",
529 dst_session: SqlASession,
530 dest_table_name: str = INFOSCHEMA_PAGENAME) -> None:
531 """
532 Writes the server's information schema to a separate database session
533 (which will be an SQLite database being created for download).
535 There must be no open transactions (i.e. please COMMIT before you call
536 this function), since we need to create a table.
537 """
538 # 1. Read the structure of INFORMATION_SCHEMA.COLUMNS itself.
539 # https://stackoverflow.com/questions/21770829/sqlalchemy-copy-schema-and-data-of-subquery-to-another-database # noqa
540 src_engine = req.engine
541 dst_engine = dst_session.bind
542 metadata = MetaData(bind=dst_engine)
543 table = Table(
544 "columns", # table name; see also "schema" argument
545 metadata, # "load with the destination metadata"
546 # Override some specific column types by hand, or they'll fail as
547 # SQLAlchemy fails to reflect the MySQL LONGTEXT type properly:
548 Column("COLUMN_DEFAULT", Text),
549 Column("COLUMN_TYPE", Text),
550 Column("GENERATION_EXPRESSION", Text),
551 autoload=True, # "read (reflect) structure from the database"
552 autoload_with=src_engine, # "read (reflect) structure from the source"
553 schema="information_schema" # schema
554 )
555 # 2. Write that structure to our new database.
556 table.name = dest_table_name # create it with a different name
557 table.schema = "" # we don't have a schema in the destination database
558 table.create(dst_engine) # CREATE TABLE
559 # 3. Fetch data.
560 query = get_information_schema_query(req)
561 # 4. Write the data.
562 for row in query:
563 dst_session.execute(table.insert(row))
564 # 5. COMMIT
565 dst_session.commit()
568# =============================================================================
569# Convert task collections to different export formats for user download
570# =============================================================================
572@register_for_json
573class DownloadOptions(object):
574 """
575 Represents options for the process of the user downloading tasks.
576 """
577 DELIVERY_MODES = [
578 ViewArg.DOWNLOAD,
579 ViewArg.EMAIL,
580 ViewArg.IMMEDIATELY,
581 ]
583 def __init__(self,
584 user_id: int,
585 viewtype: str,
586 delivery_mode: str,
587 spreadsheet_sort_by_heading: bool = False,
588 db_include_blobs: bool = False,
589 db_patient_id_per_row: bool = False,
590 include_information_schema_columns: bool = True) -> None:
591 """
592 Args:
593 user_id:
594 ID of the user creating the request (may be needed to pass to
595 the back-end)
596 viewtype:
597 file format for receiving data (e.g. XLSX, SQLite)
598 delivery_mode:
599 method of delivery (e.g. immediate, e-mail)
600 spreadsheet_sort_by_heading:
601 (For spreadsheets.)
602 Sort columns within each page by heading name?
603 db_include_blobs:
604 (For database downloads.)
605 Include BLOBs?
606 db_patient_id_per_row:
607 (For database downloads.)
608 Denormalize by include the patient ID in all rows of
609 patient-related tables?
610 include_information_schema_columns:
611 Include descriptions of the columns provided?
612 """
613 assert delivery_mode in self.DELIVERY_MODES
614 self.user_id = user_id
615 self.viewtype = viewtype
616 self.delivery_mode = delivery_mode
617 self.spreadsheet_sort_by_heading = spreadsheet_sort_by_heading
618 self.db_include_blobs = db_include_blobs
619 self.db_patient_id_per_row = db_patient_id_per_row
620 self.include_information_schema_columns = include_information_schema_columns # noqa
623class TaskCollectionExporter(object):
624 """
625 Class to provide tasks for user download.
626 """
628 def __init__(self,
629 req: "CamcopsRequest",
630 collection: "TaskCollection",
631 options: DownloadOptions):
632 """
633 Args:
634 req:
635 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
636 collection:
637 a :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
638 options:
639 :class:`DownloadOptions` governing the download
640 """ # noqa
641 self.req = req
642 self.collection = collection
643 self.options = options
645 @property
646 def viewtype(self) -> str:
647 raise NotImplementedError("Exporter needs to implement 'viewtype'")
649 @property
650 def file_extension(self) -> str:
651 raise NotImplementedError(
652 "Exporter needs to implement 'file_extension'"
653 )
655 def get_filename(self) -> str:
656 """
657 Returns the filename for the download.
658 """
659 timestamp = format_datetime(self.req.now, DateFormat.FILENAME)
660 return f"CamCOPS_dump_{timestamp}.{self.file_extension}"
662 def immediate_response(self, req: "CamcopsRequest") -> Response:
663 """
664 Returns either a :class:`Response` with the data, or a
665 :class:`Response` saying how the user will obtain their data later.
667 Args:
668 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
669 """
670 if self.options.delivery_mode == ViewArg.EMAIL:
671 self.schedule_email()
672 return render_to_response(
673 "email_scheduled.mako",
674 dict(),
675 request=req
676 )
677 elif self.options.delivery_mode == ViewArg.DOWNLOAD:
678 self.schedule_download()
679 return render_to_response(
680 "download_scheduled.mako",
681 dict(),
682 request=req
683 )
684 else: # ViewArg.IMMEDIATELY
685 return self.download_now()
687 def download_now(self) -> Response:
688 """
689 Download the data dump in the selected format
690 """
691 filename, body = self.to_file()
692 return self.get_data_response(body=body, filename=filename)
694 def schedule_email(self) -> None:
695 """
696 Schedule the export asynchronously and e-mail the logged in user
697 when done
698 """
699 email_basic_dump.delay(self.collection, self.options)
701 def send_by_email(self) -> None:
702 """
703 Send the data dump by e-mail to the logged in user
704 """
705 _ = self.req.gettext
706 config = self.req.config
708 filename, body = self.to_file()
709 email_to = self.req.user.email
710 email = Email(
711 # date: automatic
712 from_addr=config.email_from,
713 to=email_to,
714 subject=_("CamCOPS research data dump"),
715 body=_("The research data dump you requested is attached."),
716 content_type=CONTENT_TYPE_TEXT,
717 charset="utf8",
718 attachments_binary=[(filename, body)],
719 )
720 email.send(
721 host=config.email_host,
722 username=config.email_host_username,
723 password=config.email_host_password,
724 port=config.email_port,
725 use_tls=config.email_use_tls,
726 )
728 if email.sent:
729 log.info(f"Research dump emailed to {email_to}")
730 else:
731 log.error(
732 f"Failed to email research dump to {email_to}"
733 )
735 def schedule_download(self) -> None:
736 """
737 Schedule a background export to a file that the user can download
738 later.
739 """
740 create_user_download.delay(self.collection, self.options)
742 def create_user_download_and_email(self) -> None:
743 """
744 Creates a user download, and e-mails the user to let them know.
745 """
746 _ = self.req.gettext
747 config = self.req.config
749 download_dir = self.req.user_download_dir
750 space = self.req.user_download_bytes_available
751 filename, contents = self.to_file()
752 size = len(contents)
754 if size > space:
755 # Not enough space
756 total_permitted = self.req.user_download_bytes_permitted
757 msg = _(
758 "You do not have enough space to create this download. "
759 "You are allowed {total_permitted} bytes and you are have "
760 "{space} bytes free. This download would need {size} bytes."
761 ).format(total_permitted=total_permitted, space=space, size=size)
762 else:
763 # Create file
764 fullpath = os.path.join(download_dir, filename)
765 try:
766 with open(fullpath, "wb") as f:
767 f.write(contents)
768 # Success
769 log.info(f"Created user download: {fullpath}")
770 msg = _(
771 "The research data dump you requested is ready to be "
772 "downloaded. You will find it in your download area. "
773 "It is called %s"
774 ) % filename
775 except Exception as e:
776 # Some other error
777 msg = _(
778 "Failed to create file {filename}. Error was: {message}"
779 ).format(filename=filename, message=e)
781 # E-mail the user, if they have an e-mail address
782 email_to = self.req.user.email
783 if email_to:
784 email = Email(
785 # date: automatic
786 from_addr=config.email_from,
787 to=email_to,
788 subject=_("CamCOPS research data dump"),
789 body=msg,
790 content_type=CONTENT_TYPE_TEXT,
791 charset="utf8",
792 )
793 email.send(
794 host=config.email_host,
795 username=config.email_host_username,
796 password=config.email_host_password,
797 port=config.email_port,
798 use_tls=config.email_use_tls,
799 )
801 def get_data_response(self, body: bytes, filename: str) -> Response:
802 raise NotImplementedError("Exporter needs to implement 'get_response'")
804 def to_file(self) -> Tuple[str, bytes]:
805 """
806 Returns the tuple ``filename, file_contents``.
807 """
808 return self.get_filename(), self.get_file_body()
810 def get_file_body(self) -> bytes:
811 """
812 Returns binary data to be stored as a file.
813 """
814 raise NotImplementedError("Exporter needs to implement 'get_file_body'")
816 def get_tsv_collection(self) -> TsvCollection:
817 """
818 Converts the collection of tasks to a collection of spreadsheet-style
819 data. Also audits the request as a basic data dump.
821 Returns:
822 a :class:`camcops_server.cc_modules.cc_tsv.TsvCollection` object
823 """ # noqa
824 audit_descriptions = [] # type: List[str]
825 # Task may return >1 file for TSV output (e.g. for subtables).
826 tsvcoll = TsvCollection()
827 # Iterate through tasks, creating the TSV collection
828 for cls in self.collection.task_classes():
829 for task in gen_audited_tasks_for_task_class(self.collection, cls,
830 audit_descriptions):
831 tsv_pages = task.get_tsv_pages(self.req)
832 tsvcoll.add_pages(tsv_pages)
834 if self.options.include_information_schema_columns:
835 info_schema_page = get_information_schema_tsv_page(self.req)
836 tsvcoll.add_page(info_schema_page)
838 tsvcoll.sort_pages()
839 if self.options.spreadsheet_sort_by_heading:
840 tsvcoll.sort_headings_within_all_pages()
842 audit(self.req, f"Basic dump: {'; '.join(audit_descriptions)}")
844 return tsvcoll
847class OdsExporter(TaskCollectionExporter):
848 """
849 Converts a set of tasks to an OpenOffice ODS file.
850 """
851 file_extension = "ods"
852 viewtype = ViewArg.ODS
854 def get_file_body(self) -> bytes:
855 return self.get_tsv_collection().as_ods()
857 def get_data_response(self, body: bytes, filename: str) -> Response:
858 return OdsResponse(body=body, filename=filename)
861class RExporter(TaskCollectionExporter):
862 """
863 Converts a set of tasks to an R script.
864 """
865 file_extension = "R"
866 viewtype = ViewArg.R
868 def __init__(self, *args, **kwargs) -> None:
869 super().__init__(*args, **kwargs)
870 self.encoding = "utf-8"
872 def get_file_body(self) -> bytes:
873 return self.get_r_script().encode(self.encoding)
875 def get_r_script(self) -> str:
876 return self.get_tsv_collection().as_r()
878 def get_data_response(self, body: bytes, filename: str) -> Response:
879 filename = self.get_filename()
880 r_script = self.get_r_script()
881 return TextAttachmentResponse(body=r_script, filename=filename)
884class TsvZipExporter(TaskCollectionExporter):
885 """
886 Converts a set of tasks to a set of TSV (tab-separated value) file, (one
887 per table) in a ZIP file.
888 """
889 file_extension = "zip"
890 viewtype = ViewArg.TSV_ZIP
892 def get_file_body(self) -> bytes:
893 return self.get_tsv_collection().as_zip()
895 def get_data_response(self, body: bytes, filename: str) -> Response:
896 return ZipResponse(body=body, filename=filename)
899class XlsxExporter(TaskCollectionExporter):
900 """
901 Converts a set of tasks to an Excel XLSX file.
902 """
903 file_extension = "xlsx"
904 viewtype = ViewArg.XLSX
906 def get_file_body(self) -> bytes:
907 return self.get_tsv_collection().as_xlsx()
909 def get_data_response(self, body: bytes, filename: str) -> Response:
910 return XlsxResponse(body=body, filename=filename)
913class SqliteExporter(TaskCollectionExporter):
914 """
915 Converts a set of tasks to an SQLite binary file.
916 """
917 file_extension = "sqlite"
918 viewtype = ViewArg.SQLITE
920 def get_export_options(self) -> TaskExportOptions:
921 return TaskExportOptions(
922 include_blobs=self.options.db_include_blobs,
923 db_include_summaries=True,
924 db_make_all_tables_even_empty=True, # debatable, but more consistent! # noqa
925 db_patient_id_per_row=self.options.db_patient_id_per_row,
926 )
928 def get_sqlite_data(self, as_text: bool) -> Union[bytes, str]:
929 """
930 Returns data as a binary SQLite database, or SQL text to create it.
932 Args:
933 as_text: textual SQL, rather than binary SQLite?
935 Returns:
936 ``bytes`` or ``str``, according to ``as_text``
937 """
938 # ---------------------------------------------------------------------
939 # Create memory file, dumper, and engine
940 # ---------------------------------------------------------------------
942 # This approach failed:
943 #
944 # memfile = io.StringIO()
945 #
946 # def dump(querysql, *multiparams, **params):
947 # compsql = querysql.compile(dialect=engine.dialect)
948 # memfile.write("{};\n".format(compsql))
949 #
950 # engine = create_engine('{dialect}://'.format(dialect=dialect_name),
951 # strategy='mock', executor=dump)
952 # dst_session = sessionmaker(bind=engine)() # type: SqlASession
953 #
954 # ... you get the error
955 # AttributeError: 'MockConnection' object has no attribute 'begin'
956 # ... which is fair enough.
957 #
958 # Next best thing: SQLite database.
959 # Two ways to deal with it:
960 # (a) duplicate our C++ dump code (which itself duplicate the SQLite
961 # command-line executable's dump facility), then create the
962 # database, dump it to a string, serve the string; or
963 # (b) offer the binary SQLite file.
964 # Or... (c) both.
965 # Aha! pymysqlite.iterdump does this for us.
966 #
967 # If we create an in-memory database using create_engine('sqlite://'),
968 # can we get the binary contents out? Don't think so.
969 #
970 # So we should first create a temporary on-disk file, then use that.
972 # ---------------------------------------------------------------------
973 # Make temporary file (one whose filename we can know).
974 # ---------------------------------------------------------------------
975 # We use tempfile.mkstemp() for security, or NamedTemporaryFile,
976 # which is a bit easier. However, you can't necessarily open the file
977 # again under all OSs, so that's no good. The final option is
978 # TemporaryDirectory, which is secure and convenient.
979 #
980 # https://docs.python.org/3/library/tempfile.html
981 # https://security.openstack.org/guidelines/dg_using-temporary-files-securely.html # noqa
982 # https://stackoverflow.com/questions/3924117/how-to-use-tempfile-namedtemporaryfile-in-python # noqa
983 db_basename = "temp.sqlite3"
984 with tempfile.TemporaryDirectory() as tmpdirname:
985 db_filename = os.path.join(tmpdirname, db_basename)
986 # ---------------------------------------------------------------------
987 # Make SQLAlchemy session
988 # ---------------------------------------------------------------------
989 url = "sqlite:///" + db_filename
990 engine = create_engine(url, echo=False)
991 dst_session = sessionmaker(bind=engine)() # type: SqlASession
992 # ---------------------------------------------------------------------
993 # Iterate through tasks, creating tables as we need them.
994 # ---------------------------------------------------------------------
995 audit_descriptions = [] # type: List[str]
996 task_generator = gen_audited_tasks_by_task_class(self.collection,
997 audit_descriptions)
998 # ---------------------------------------------------------------------
999 # Next bit very tricky. We're trying to achieve several things:
1000 # - a copy of part of the database structure
1001 # - a copy of part of the data, with relationships intact
1002 # - nothing sensitive (e.g. full User records) going through
1003 # - adding new columns for Task objects offering summary values
1004 # - Must treat tasks all together, because otherwise we will insert
1005 # duplicate dependency objects like Group objects.
1006 # ---------------------------------------------------------------------
1007 copy_tasks_and_summaries(tasks=task_generator,
1008 dst_engine=engine,
1009 dst_session=dst_session,
1010 export_options=self.get_export_options(),
1011 req=self.req)
1012 dst_session.commit()
1013 if self.options.include_information_schema_columns:
1014 # Must have committed before we do this:
1015 write_information_schema_to_dst(self.req, dst_session)
1016 # ---------------------------------------------------------------------
1017 # Audit
1018 # ---------------------------------------------------------------------
1019 audit(self.req, f"SQL dump: {'; '.join(audit_descriptions)}")
1020 # ---------------------------------------------------------------------
1021 # Fetch file contents, either as binary, or as SQL
1022 # ---------------------------------------------------------------------
1023 if as_text:
1024 # SQL text
1025 connection = sqlite3.connect(db_filename) # type: sqlite3.Connection # noqa
1026 sql_text = sql_from_sqlite_database(connection)
1027 connection.close()
1028 return sql_text
1029 else:
1030 # SQLite binary
1031 with open(db_filename, 'rb') as f:
1032 binary_contents = f.read()
1033 return binary_contents
1035 def get_file_body(self) -> bytes:
1036 return self.get_sqlite_data(as_text=False)
1038 def get_data_response(self, body: bytes, filename: str) -> Response:
1039 return SqliteBinaryResponse(body=body, filename=filename)
1042class SqlExporter(SqliteExporter):
1043 """
1044 Converts a set of tasks to the textual SQL needed to create an SQLite file.
1045 """
1046 file_extension = "sql"
1047 viewtype = ViewArg.SQL
1049 def __init__(self, *args, **kwargs) -> None:
1050 super().__init__(*args, **kwargs)
1051 self.encoding = "utf-8"
1053 def get_file_body(self) -> bytes:
1054 return self.get_sql().encode(self.encoding)
1056 def get_sql(self) -> str:
1057 """
1058 Returns SQL text representing the SQLite database.
1059 """
1060 return self.get_sqlite_data(as_text=True)
1062 def download_now(self) -> Response:
1063 """
1064 Download the data dump in the selected format
1065 """
1066 filename = self.get_filename()
1067 sql_text = self.get_sql()
1068 return TextAttachmentResponse(body=sql_text, filename=filename)
1070 def get_data_response(self, body: bytes, filename: str) -> Response:
1071 """
1072 Unused.
1073 """
1074 pass
1077# Create mapping from "viewtype" to class.
1078# noinspection PyTypeChecker
1079DOWNLOADER_CLASSES = {} # type: Dict[str, Type[TaskCollectionExporter]]
1080for _cls in gen_all_subclasses(TaskCollectionExporter): # type: Type[TaskCollectionExporter] # noqa
1081 # noinspection PyTypeChecker
1082 DOWNLOADER_CLASSES[_cls.viewtype] = _cls
1085def make_exporter(req: "CamcopsRequest",
1086 collection: "TaskCollection",
1087 options: DownloadOptions) -> TaskCollectionExporter:
1088 """
1090 Args:
1091 req:
1092 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
1093 collection:
1094 a
1095 :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
1096 options:
1097 :class:`camcops_server.cc_modules.cc_export.DownloadOptions`
1098 governing the download
1100 Returns:
1101 a :class:`BasicTaskCollectionExporter`
1103 Raises:
1104 :exc:`HTTPBadRequest` if the arguments are bad
1105 """
1106 _ = req.gettext
1107 if options.delivery_mode not in DownloadOptions.DELIVERY_MODES:
1108 raise HTTPBadRequest(
1109 f"{_('Bad delivery mode:')} {options.delivery_mode!r} "
1110 f"({_('permissible:')} "
1111 f"{DownloadOptions.DELIVERY_MODES!r})")
1112 try:
1113 downloader_class = DOWNLOADER_CLASSES[options.viewtype]
1114 except KeyError:
1115 raise HTTPBadRequest(
1116 f"{_('Bad output type:')} {options.viewtype!r} "
1117 f"({_('permissible:')} {DOWNLOADER_CLASSES.keys()!r})")
1118 return downloader_class(
1119 req=req,
1120 collection=collection,
1121 options=options
1122 )
1125# =============================================================================
1126# Represent files for users to download
1127# =============================================================================
1129class UserDownloadFile(object):
1130 """
1131 Represents a file that has been generated for the user to download.
1133 Test code:
1135 .. code-block:: python
1137 from camcops_server.cc_modules.cc_export import UserDownloadFile
1138 x = UserDownloadFile("/etc/hosts")
1140 print(x.when_last_modified) # should match output of: ls -l /etc/hosts
1142 many = UserDownloadFile.from_directory_scan("/etc")
1144 """
1145 def __init__(self, filename: str, directory: str = "",
1146 permitted_lifespan_min: float = 0,
1147 req: "CamcopsRequest" = None) -> None:
1148 """
1149 Args:
1150 filename: filename relative to ``directory``
1151 directory: directory
1152 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
1154 Notes:
1156 - The Unix ``ls`` command shows timestamps in the current timezone.
1157 Try ``TZ=utc ls -l <filename>`` or ``TZ="America/New_York" ls -l
1158 <filename>`` to see this.
1159 - The underlying timestamp is the time (in seconds) since the Unix
1160 "epoch", which is 00:00:00 UTC on 1 Jan 1970
1161 (https://en.wikipedia.org/wiki/Unix_time).
1162 """
1163 self.filename = filename
1164 self.directory = directory
1165 self.permitted_lifespan_min = permitted_lifespan_min
1166 self.req = req
1168 self.basename = os.path.basename(filename)
1169 _, self.extension = os.path.splitext(filename)
1170 if directory:
1171 self.fullpath = os.path.join(directory, filename)
1172 else:
1173 self.fullpath = filename
1174 try:
1175 self.statinfo = os.stat(self.fullpath)
1176 self.exists = True
1177 except FileNotFoundError:
1178 self.statinfo = None # type: Optional[os.stat_result]
1179 self.exists = False
1181 # -------------------------------------------------------------------------
1182 # Size
1183 # -------------------------------------------------------------------------
1185 @property
1186 def size(self) -> Optional[int]:
1187 """
1188 Size of the file, in bytes. Returns ``None`` if the file does not
1189 exist.
1190 """
1191 return self.statinfo.st_size if self.exists else None
1193 @property
1194 def size_str(self) -> str:
1195 """
1196 Returns a pretty-format string describing the file's size.
1197 """
1198 size_bytes = self.size
1199 if size_bytes is None:
1200 return ""
1201 return bytes2human(size_bytes)
1203 # -------------------------------------------------------------------------
1204 # Timing
1205 # -------------------------------------------------------------------------
1207 @property
1208 def when_last_modified(self) -> Optional[Pendulum]:
1209 """
1210 Returns the file's modification time, or ``None`` if it doesn't exist.
1212 (Creation time is harder! See
1213 https://stackoverflow.com/questions/237079/how-to-get-file-creation-modification-date-times-in-python.)
1214 """ # noqa
1215 if not self.exists:
1216 return None
1217 # noinspection PyTypeChecker
1218 creation = Pendulum.fromtimestamp(self.statinfo.st_mtime,
1219 tz=get_tz_utc()) # type: Pendulum
1220 # ... gives the correct time in the UTC timezone
1221 # ... note that utcfromtimestamp() gives a time without a timezone,
1222 # which is unhelpful!
1223 # We would like this to display in the current timezone:
1224 return creation.in_timezone(get_tz_local())
1226 @property
1227 def when_last_modified_str(self) -> str:
1228 """
1229 Returns a formatted string with the file's modification time.
1230 """
1231 w = self.when_last_modified
1232 if not w:
1233 return ""
1234 return format_datetime(w, DateFormat.ISO8601_HUMANIZED_TO_SECONDS)
1236 @property
1237 def time_left(self) -> Optional[Duration]:
1238 """
1239 Returns the amount of time that this file has left to live before
1240 the server will delete it. Returns ``None`` if the file does not exist.
1241 """
1242 if not self.exists:
1243 return None
1244 now = get_now_localtz_pendulum()
1245 death = (
1246 self.when_last_modified +
1247 Duration(minutes=self.permitted_lifespan_min)
1248 )
1249 remaining = death - now # type: Period
1250 # Note that Period is a subclass of Duration, but its __str__()
1251 # method is different. Duration maps __str__() to in_words(), but
1252 # Period maps __str__() to __repr__().
1253 return remaining
1255 @property
1256 def time_left_str(self) -> str:
1257 """
1258 A string version of :meth:`time_left`.
1259 """
1260 t = self.time_left
1261 if not t:
1262 return ""
1263 return t.in_words() # Duration and Period do nice formatting
1265 def older_than(self, when: Pendulum) -> bool:
1266 """
1267 Was the file created before the specified time?
1268 """
1269 m = self.when_last_modified
1270 if not m:
1271 return False
1272 return m < when
1274 # -------------------------------------------------------------------------
1275 # Deletion
1276 # -------------------------------------------------------------------------
1278 @property
1279 def delete_form(self) -> str:
1280 """
1281 Returns HTML for a form to delete this file.
1282 """
1283 if not self.req:
1284 return ""
1285 dest_url = self.req.route_url(Routes.DELETE_FILE)
1286 form = UserDownloadDeleteForm(
1287 request=self.req,
1288 action=dest_url
1289 )
1290 appstruct = {ViewParam.FILENAME: self.filename}
1291 rendered_form = form.render(appstruct)
1292 return rendered_form
1294 def delete(self) -> None:
1295 """
1296 Deletes the file. Does not raise an exception if the file does not
1297 exist.
1298 """
1299 try:
1300 os.remove(self.fullpath)
1301 log.info(f"Deleted file: {self.fullpath}")
1302 except OSError:
1303 pass
1305 # -------------------------------------------------------------------------
1306 # Downloading
1307 # -------------------------------------------------------------------------
1309 @property
1310 def download_url(self) -> str:
1311 """
1312 Returns a URL to download this file.
1313 """
1314 if not self.req:
1315 return ""
1316 querydict = {
1317 ViewParam.FILENAME: self.filename
1318 }
1319 return self.req.route_url(Routes.DOWNLOAD_FILE, _query=querydict)
1321 @property
1322 def contents(self) -> Optional[bytes]:
1323 """
1324 The file contents. May raise :exc:`OSError` if the read fails.
1325 """
1326 if not self.exists:
1327 return None
1328 with open(self.fullpath, "rb") as f:
1329 return f.read()
1331 # -------------------------------------------------------------------------
1332 # Bulk creation
1333 # -------------------------------------------------------------------------
1335 @classmethod
1336 def from_directory_scan(
1337 cls, directory: str,
1338 permitted_lifespan_min: float = 0,
1339 req: "CamcopsRequest" = None) -> List["UserDownloadFile"]:
1340 """
1341 Scans the directory and returns a list of :class:`UserDownloadFile`
1342 objects, one for each file in the directory.
1344 For each object, ``directory`` is the root directory (our parameter
1345 here), and ``filename`` is the filename RELATIVE to that.
1347 Args:
1348 directory: directory to scan
1349 permitted_lifespan_min: lifespan for each file
1350 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
1351 """
1352 results = [] # type: List[UserDownloadFile]
1353 # Imagine directory == "/etc":
1354 for root, dirs, files in os.walk(directory):
1355 # ... then root might at times be "/etc/apache2"
1356 for f in files:
1357 fullpath = os.path.join(root, f)
1358 relative_filename = relative_filename_within_dir(
1359 fullpath, directory)
1360 results.append(UserDownloadFile(
1361 filename=relative_filename,
1362 directory=directory,
1363 permitted_lifespan_min=permitted_lifespan_min,
1364 req=req
1365 ))
1366 return results