Coverage for cc_modules/cc_export.py: 31%
471 statements
« prev ^ index » next coverage.py v6.5.0, created at 2022-11-08 23:14 +0000
« prev ^ index » next coverage.py v6.5.0, created at 2022-11-08 23:14 +0000
1#!/usr/bin/env python
3# noinspection HttpUrlsUsage
4"""
5camcops_server/cc_modules/cc_export.py
7===============================================================================
9 Copyright (C) 2012, University of Cambridge, Department of Psychiatry.
10 Created by Rudolf Cardinal (rnc1001@cam.ac.uk).
12 This file is part of CamCOPS.
14 CamCOPS is free software: you can redistribute it and/or modify
15 it under the terms of the GNU General Public License as published by
16 the Free Software Foundation, either version 3 of the License, or
17 (at your option) any later version.
19 CamCOPS is distributed in the hope that it will be useful,
20 but WITHOUT ANY WARRANTY; without even the implied warranty of
21 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
22 GNU General Public License for more details.
24 You should have received a copy of the GNU General Public License
25 along with CamCOPS. If not, see <https://www.gnu.org/licenses/>.
27===============================================================================
29.. _ActiveMQ: https://activemq.apache.org/
30.. _AMQP: https://www.amqp.org/
31.. _APScheduler: https://apscheduler.readthedocs.io/
32.. _Celery: https://www.celeryproject.org/
33.. _Dramatiq: https://dramatiq.io/
34.. _RabbitMQ: https://www.rabbitmq.com/
35.. _Redis: https://redis.io/
36.. _ZeroMQ: https://zeromq.org/
38**Export and research dump functions.**
40Export design:
42*WHICH RECORDS TO SEND?*
44The most powerful mechanism is not to have a sending queue (which would then
45require careful multi-instance locking), but to have a "sent" log. That way:
47- A record needs sending if it's not in the sent log (for an appropriate
48 recipient).
49- You can add a new recipient and the system will know about the (new)
50 backlog automatically.
51- You can specify criteria, e.g. don't upload records before 1/1/2014, and
52 modify that later, and it would catch up with the backlog.
53- Successes and failures are logged in the same table.
54- Multiple recipients are handled with ease.
55- No need to alter database.pl code that receives from tablets.
56- Can run with a simple cron job.
58*LOCKING*
60- Don't use database locking:
61 https://blog.engineyard.com/2011/5-subtle-ways-youre-using-mysql-as-a-queue-and-why-itll-bite-you
62- Locking via UNIX lockfiles:
64 - https://pypi.python.org/pypi/lockfile
65 - http://pythonhosted.org/lockfile/ (which also works on Windows)
67 - On UNIX, ``lockfile`` uses ``LinkLockFile``:
68 https://github.com/smontanaro/pylockfile/blob/master/lockfile/linklockfile.py
70*MESSAGE QUEUE AND BACKEND*
72Thoughts as of 2018-12-22.
74- See https://www.fullstackpython.com/task-queues.html. Also http://queues.io/;
75 https://stackoverflow.com/questions/731233/activemq-or-rabbitmq-or-zeromq-or.
77- The "default" is Celery_, with ``celery beat`` for scheduling, via an
78 AMQP_ broker like RabbitMQ_.
80 - Downside: no longer supported under Windows as of Celery 4.
82 - There are immediate bugs when running the demo code with Celery 4.2.1,
83 fixed by setting the environment variable ``set
84 FORKED_BY_MULTIPROCESSING=1`` before running the worker; see
85 https://github.com/celery/celery/issues/4178 and
86 https://github.com/celery/celery/pull/4078.
88 - Downside: backend is complex; e.g. Erlang dependency of RabbitMQ.
90 - Celery also supports Redis_, but Redis_ doesn't support Windows directly
91 (except the Windows Subsystem for Linux in Windows 10+).
93- Another possibility is Dramatiq_ with APScheduler_.
95 - Of note, APScheduler_ can use an SQLAlchemy database table as its job
96 store, which might be good.
97 - Dramatiq_ uses RabbitMQ_ or Redis_.
98 - Dramatiq_ 1.4.0 (2018-11-25) installs cleanly under Windows. Use ``pip
99 install --upgrade "dramatic[rabbitmq, watch]"`` (i.e. with double quotse,
100 not the single quotes it suggests, which don't work under Windows).
101 - However, the basic example (https://dramatiq.io/guide.html) fails under
102 Windows; when you fire up ``dramatic count_words`` (even with ``--processes
103 1 --threads 1``) it crashes with an error from ``ForkingPickler`` in
104 ``multiprocessing.reduction``, i.e.
105 https://docs.python.org/3/library/multiprocessing.html#windows. It also
106 emits a ``PermissionError: [WinError 5] Access is denied``. This is
107 discussed a bit at https://github.com/Bogdanp/dramatiq/issues/75;
108 https://github.com/Bogdanp/dramatiq/blob/master/docs/source/changelog.rst.
109 The changelog suggests 1.4.0 should work, but it doesn't.
111- Worth some thought about ZeroMQ_, which is a very different sort of thing.
112 Very cross-platform. Needs work to guard against message loss (i.e. messages
113 are unreliable by default). Dynamic "special socket" style.
115- Possibly also ActiveMQ_.
117- OK; so speed is not critical but we want message reliability, for it to work
118 under Windows, and decent Python bindings with job scheduling.
120 - OUT: Redis (not Windows easily), ZeroMQ (fast but not by default reliable),
121 ActiveMQ (few Python frameworks?).
122 - REMAINING for message handling: RabbitMQ.
123 - Python options therefore: Celery (but Windows not officially supported from
124 4+); Dramatiq (but Windows also not very well supported and seems a bit
125 bleeding-edge).
127- This is looking like a mess from the Windows perspective.
129- An alternative is just to use the database, of course.
131 - https://softwareengineering.stackexchange.com/questions/351449/message-queue-database-vs-dedicated-mq
132 - http://mikehadlow.blogspot.com/2012/04/database-as-queue-anti-pattern.html
133 - https://blog.jooq.org/2014/09/26/using-your-rdbms-for-messaging-is-totally-ok/
134 - https://stackoverflow.com/questions/13005410/why-do-we-need-message-brokers-like-rabbitmq-over-a-database-like-postgresql
135 - https://www.quora.com/What-is-the-best-practice-using-db-tables-or-message-queues-for-moderation-of-content-approved-by-humans
137- Let's take a step back and summarize the problem.
139 - Many web threads may upload tasks. This should trigger a prompt export for
140 all push recipients.
141 - Whichever way we schedule a backend task job, it should be as the
142 combination of recipient, basetable, task PK. (That way, if one recipient
143 fails, the others can proceed independently.)
144 - Every job should check that it's not been completed already (in case of
145 accidental job restarts), i.e. is idempotent as far as we can make it.
146 - How should this interact with the non-push recipients?
147 - We should use the same locking method for push and non-push recipients.
148 - We should make the locking granular and use file locks -- for example, for
149 each task/recipient combination (or each whole-database export for a given
150 recipient).
152""" # noqa
154from contextlib import ExitStack
155import json
156import logging
157import os
158import sqlite3
159import tempfile
160from typing import (
161 Dict,
162 List,
163 Generator,
164 Optional,
165 Set,
166 Tuple,
167 Type,
168 TYPE_CHECKING,
169 Union,
170)
172from cardinal_pythonlib.classes import gen_all_subclasses
173from cardinal_pythonlib.datetimefunc import (
174 format_datetime,
175 get_now_localtz_pendulum,
176 get_tz_local,
177 get_tz_utc,
178)
179from cardinal_pythonlib.email.sendmail import CONTENT_TYPE_TEXT
180from cardinal_pythonlib.fileops import relative_filename_within_dir
181from cardinal_pythonlib.json.serialize import register_for_json
182from cardinal_pythonlib.logs import BraceStyleAdapter
183from cardinal_pythonlib.pyramid.responses import (
184 OdsResponse,
185 SqliteBinaryResponse,
186 TextAttachmentResponse,
187 XlsxResponse,
188 ZipResponse,
189)
190from cardinal_pythonlib.sizeformatter import bytes2human
191from cardinal_pythonlib.sqlalchemy.session import get_safe_url_from_engine
192import lockfile
193from pendulum import DateTime as Pendulum, Duration, Period
194from pyramid.httpexceptions import HTTPBadRequest
195from pyramid.renderers import render_to_response
196from pyramid.response import Response
197from sqlalchemy.engine import create_engine
198from sqlalchemy.engine.result import ResultProxy
199from sqlalchemy.orm import Session as SqlASession, sessionmaker
200from sqlalchemy.sql.expression import text
201from sqlalchemy.sql.schema import Column, MetaData, Table
202from sqlalchemy.sql.sqltypes import Text
204from camcops_server.cc_modules.cc_audit import audit
205from camcops_server.cc_modules.cc_constants import DateFormat, JSON_INDENT
206from camcops_server.cc_modules.cc_dataclasses import SummarySchemaInfo
207from camcops_server.cc_modules.cc_db import (
208 REMOVE_COLUMNS_FOR_SIMPLIFIED_SPREADSHEETS,
209)
210from camcops_server.cc_modules.cc_dump import copy_tasks_and_summaries
211from camcops_server.cc_modules.cc_email import Email
212from camcops_server.cc_modules.cc_exception import FhirExportException
213from camcops_server.cc_modules.cc_exportmodels import (
214 ExportedTask,
215 ExportRecipient,
216 gen_tasks_having_exportedtasks,
217 get_collection_for_export,
218)
219from camcops_server.cc_modules.cc_forms import UserDownloadDeleteForm
220from camcops_server.cc_modules.cc_pyramid import Routes, ViewArg, ViewParam
221from camcops_server.cc_modules.cc_simpleobjects import TaskExportOptions
222from camcops_server.cc_modules.cc_sqlalchemy import sql_from_sqlite_database
223from camcops_server.cc_modules.cc_task import SNOMED_TABLENAME, Task
224from camcops_server.cc_modules.cc_spreadsheet import (
225 SpreadsheetCollection,
226 SpreadsheetPage,
227)
228from camcops_server.cc_modules.celery import (
229 create_user_download,
230 email_basic_dump,
231 export_task_backend,
232 jittered_delay_s,
233)
235if TYPE_CHECKING:
236 from camcops_server.cc_modules.cc_request import CamcopsRequest
237 from camcops_server.cc_modules.cc_taskcollection import TaskCollection
239log = BraceStyleAdapter(logging.getLogger(__name__))
242# =============================================================================
243# Constants
244# =============================================================================
246INFOSCHEMA_PAGENAME = "_camcops_information_schema_columns"
247SUMMARYSCHEMA_PAGENAME = "_camcops_column_explanations"
248REMOVE_TABLES_FOR_SIMPLIFIED_SPREADSHEETS = {SNOMED_TABLENAME}
249EMPTY_SET = set()
252# =============================================================================
253# Export tasks from the back end
254# =============================================================================
257def print_export_queue(
258 req: "CamcopsRequest",
259 recipient_names: List[str] = None,
260 all_recipients: bool = False,
261 via_index: bool = True,
262 pretty: bool = False,
263 debug_show_fhir: bool = False,
264 debug_fhir_include_docs: bool = False,
265) -> None:
266 """
267 Shows tasks that would be exported.
269 - Called from the command line.
271 Args:
272 req:
273 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
274 recipient_names:
275 list of export recipient names (as per the config file)
276 all_recipients:
277 use all recipients?
278 via_index:
279 use the task index (faster)?
280 pretty:
281 use ``str(task)`` not ``repr(task)`` (prettier, but slower because
282 it has to query the patient)
283 debug_show_fhir:
284 Show FHIR output for each task, as JSON?
285 debug_fhir_include_docs:
286 (If debug_show_fhir.) Include document content? Large!
287 """
288 recipients = req.get_export_recipients(
289 recipient_names=recipient_names,
290 all_recipients=all_recipients,
291 save=False,
292 )
293 if not recipients:
294 log.warning("No export recipients")
295 return
296 for recipient in recipients:
297 log.info("Tasks to be exported for recipient: {}", recipient)
298 collection = get_collection_for_export(
299 req, recipient, via_index=via_index
300 )
301 for task in collection.gen_tasks_by_class():
302 print(
303 f"{recipient.recipient_name}: "
304 f"{str(task) if pretty else repr(task)}"
305 )
306 if debug_show_fhir:
307 try:
308 bundle = task.get_fhir_bundle(
309 req,
310 recipient,
311 skip_docs_if_other_content=not debug_fhir_include_docs,
312 )
313 bundle_str = json.dumps(
314 bundle.as_json(), indent=JSON_INDENT
315 )
316 log.info("FHIR output as JSON:\n{}", bundle_str)
317 except FhirExportException as e:
318 log.info("Task has no non-document content:\n{}", e)
321def export(
322 req: "CamcopsRequest",
323 recipient_names: List[str] = None,
324 all_recipients: bool = False,
325 via_index: bool = True,
326 schedule_via_backend: bool = False,
327) -> None:
328 """
329 Exports all relevant tasks (pending incremental exports, or everything if
330 applicable) for specified export recipients.
332 - Called from the command line, or from
333 :func:`camcops_server.cc_modules.celery.export_to_recipient_backend`.
334 - Calls :func:`export_whole_database` or :func:`export_tasks_individually`.
336 Args:
337 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
338 recipient_names: list of export recipient names (as per the config
339 file)
340 all_recipients: use all recipients?
341 via_index: use the task index (faster)?
342 schedule_via_backend: schedule jobs via the backend instead?
343 """
344 recipients = req.get_export_recipients(
345 recipient_names=recipient_names, all_recipients=all_recipients
346 )
347 if not recipients:
348 log.warning("No export recipients")
349 return
351 for recipient in recipients:
352 log.info("Exporting to recipient: {}", recipient.recipient_name)
353 if recipient.using_db():
354 if schedule_via_backend:
355 raise NotImplementedError(
356 "Not yet implemented: whole-database export via Celery "
357 "backend"
358 ) # todo: implement whole-database export via Celery backend # noqa
359 else:
360 export_whole_database(req, recipient, via_index=via_index)
361 else:
362 # Non-database recipient.
363 export_tasks_individually(
364 req,
365 recipient,
366 via_index=via_index,
367 schedule_via_backend=schedule_via_backend,
368 )
369 log.info("Finished exporting to {}", recipient.recipient_name)
372def export_whole_database(
373 req: "CamcopsRequest", recipient: ExportRecipient, via_index: bool = True
374) -> None:
375 """
376 Exports to a database.
378 - Called by :func:`export`.
379 - Holds a recipient-specific "database" file lock in the process.
381 Args:
382 req:
383 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
384 recipient:
385 an
386 :class:`camcops_server.cc_modules.cc_exportmodels.ExportRecipient`
387 via_index:
388 use the task index (faster)?
389 """
390 cfg = req.config
391 lockfilename = cfg.get_export_lockfilename_recipient_db(
392 recipient_name=recipient.recipient_name
393 )
394 try:
395 with lockfile.FileLock(lockfilename, timeout=0): # doesn't wait
396 collection = get_collection_for_export(
397 req, recipient, via_index=via_index
398 )
399 dst_engine = create_engine(
400 recipient.db_url, echo=recipient.db_echo
401 )
402 log.info(
403 "Exporting to database: {}",
404 get_safe_url_from_engine(dst_engine),
405 )
406 dst_session = sessionmaker(bind=dst_engine)() # type: SqlASession
407 task_generator = gen_tasks_having_exportedtasks(collection)
408 export_options = TaskExportOptions(
409 include_blobs=recipient.db_include_blobs,
410 db_patient_id_per_row=recipient.db_patient_id_per_row,
411 db_make_all_tables_even_empty=True,
412 db_include_summaries=recipient.db_add_summaries,
413 )
414 copy_tasks_and_summaries(
415 tasks=task_generator,
416 dst_engine=dst_engine,
417 dst_session=dst_session,
418 export_options=export_options,
419 req=req,
420 )
421 dst_session.commit()
422 except lockfile.AlreadyLocked:
423 log.warning(
424 "Export logfile {!r} already locked by another process; "
425 "aborting (another process is doing this work)",
426 lockfilename,
427 )
428 # No need to retry by raising -- if someone else holds this lock, they
429 # are doing the work that we wanted to do.
432def export_tasks_individually(
433 req: "CamcopsRequest",
434 recipient: ExportRecipient,
435 via_index: bool = True,
436 schedule_via_backend: bool = False,
437) -> None:
438 """
439 Exports all necessary tasks for a recipient.
441 - Called by :func:`export`.
442 - Calls :func:`export_task`, if ``schedule_via_backend`` is False.
443 - Schedules :func:``camcops_server.cc_modules.celery.export_task_backend``,
444 if ``schedule_via_backend`` is True, which calls :func:`export` in turn.
446 Args:
447 req:
448 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
449 recipient:
450 an
451 :class:`camcops_server.cc_modules.cc_exportmodels.ExportRecipient`
452 via_index:
453 use the task index (faster)?
454 schedule_via_backend:
455 schedule jobs via the backend instead?
456 """
457 collection = get_collection_for_export(req, recipient, via_index=via_index)
458 n_tasks = 0
459 recipient_name = recipient.recipient_name
460 if schedule_via_backend:
461 for task_or_index in collection.gen_all_tasks_or_indexes():
462 if isinstance(task_or_index, Task):
463 basetable = task_or_index.tablename
464 task_pk = task_or_index.pk
465 else:
466 basetable = task_or_index.task_table_name
467 task_pk = task_or_index.task_pk
468 log.info(
469 "Scheduling job to export task {}.{} to {}",
470 basetable,
471 task_pk,
472 recipient_name,
473 )
474 export_task_backend.delay(
475 recipient_name=recipient_name,
476 basetable=basetable,
477 task_pk=task_pk,
478 )
479 n_tasks += 1
480 log.info(
481 f"Scheduled {n_tasks} background task exports to "
482 f"{recipient_name}"
483 )
484 else:
485 for task in collection.gen_tasks_by_class():
486 # Do NOT use this to check the working of export_task_backend():
487 # export_task_backend(recipient.recipient_name, task.tablename, task.pk) # noqa
488 # ... it will deadlock at the database (because we're already
489 # within a query of some sort, I presume)
490 export_task(req, recipient, task)
491 n_tasks += 1
492 log.info(f"Exported {n_tasks} tasks to {recipient_name}")
495def export_task(
496 req: "CamcopsRequest", recipient: ExportRecipient, task: Task
497) -> None:
498 """
499 Exports a single task, checking that it remains valid to do so.
501 - Called by :func:`export_tasks_individually` directly, or called via
502 :func:``camcops_server.cc_modules.celery.export_task_backend`` if
503 :func:`export_tasks_individually` requested that.
504 - Calls
505 :meth:`camcops_server.cc_modules.cc_exportmodels.ExportedTask.export`.
506 - For FHIR, holds a recipient-specific "FHIR" file lock during export.
507 - Always holds a recipient-and-task-specific file lock during export.
509 Args:
510 req:
511 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
512 recipient:
513 an
514 :class:`camcops_server.cc_modules.cc_exportmodels.ExportRecipient`
515 task:
516 a :class:`camcops_server.cc_modules.cc_task.Task`
517 """
519 # Double-check it's OK! Just in case, for example, an old backend task has
520 # persisted, or someone's managed to get an iffy back-end request in some
521 # other way.
522 if not recipient.is_task_suitable(task):
523 # Warning will already have been emitted (by is_task_suitable).
524 return
526 cfg = req.config
527 lockfilename = cfg.get_export_lockfilename_recipient_task(
528 recipient_name=recipient.recipient_name,
529 basetable=task.tablename,
530 pk=task.pk,
531 )
532 dbsession = req.dbsession
533 with ExitStack() as stack:
535 if recipient.using_fhir() and not recipient.fhir_concurrent:
536 # Some FHIR servers struggle with parallel processing, so we hold
537 # a lock to serialize them. See notes in cc_fhir.py.
538 #
539 # We always use the order (1) FHIR lockfile, (2) task lockfile, to
540 # avoid a deadlock.
541 #
542 # (Note that it is impossible that a non-FHIR task export grabs the
543 # second of these without the first, because the second lockfile is
544 # recipient-specific and the recipient details include the fact
545 # that it is a FHIR recipient.)
546 fhir_lockfilename = cfg.get_export_lockfilename_recipient_fhir(
547 recipient_name=recipient.recipient_name
548 )
549 try:
550 stack.enter_context(
551 lockfile.FileLock(
552 fhir_lockfilename, timeout=jittered_delay_s()
553 )
554 # waits for a while
555 )
556 except lockfile.AlreadyLocked:
557 log.warning(
558 "Export logfile {!r} already locked by another process; "
559 "will try again later",
560 fhir_lockfilename,
561 )
562 raise
563 # We will reschedule via Celery; see "self.retry(...)" in
564 # celery.py
566 try:
567 stack.enter_context(
568 lockfile.FileLock(lockfilename, timeout=0) # doesn't wait
569 )
570 # We recheck the export status once we hold the lock, in case
571 # multiple jobs are competing to export it.
572 if ExportedTask.task_already_exported(
573 dbsession=dbsession,
574 recipient_name=recipient.recipient_name,
575 basetable=task.tablename,
576 task_pk=task.pk,
577 ):
578 log.info(
579 "Task {!r} already exported to recipient {}; " "ignoring",
580 task,
581 recipient,
582 )
583 # Not a warning; it's normal to see these because it allows the
584 # client API to skip some checks for speed.
585 return
586 # OK; safe to export now.
587 et = ExportedTask(recipient, task)
588 dbsession.add(et)
589 et.export(req)
590 dbsession.commit() # so the ExportedTask is visible to others ASAP
591 except lockfile.AlreadyLocked:
592 log.warning(
593 "Export logfile {!r} already locked by another process; "
594 "aborting (another process is doing this work)",
595 lockfilename,
596 )
599# =============================================================================
600# Helpers for task collection export functions
601# =============================================================================
604def gen_audited_tasks_for_task_class(
605 collection: "TaskCollection",
606 cls: Type[Task],
607 audit_descriptions: List[str],
608) -> Generator[Task, None, None]:
609 """
610 Generates tasks from a collection, for a given task class, simultaneously
611 adding to an audit description. Used for user-triggered downloads.
613 Args:
614 collection:
615 a
616 :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
617 cls:
618 the task class to generate
619 audit_descriptions:
620 list of strings to be modified
622 Yields:
623 :class:`camcops_server.cc_modules.cc_task.Task` objects
624 """
625 pklist = [] # type: List[int]
626 for task in collection.tasks_for_task_class(cls):
627 pklist.append(task.pk)
628 yield task
629 audit_descriptions.append(
630 f"{cls.__tablename__}: " f"{','.join(str(pk) for pk in pklist)}"
631 )
634def gen_audited_tasks_by_task_class(
635 collection: "TaskCollection", audit_descriptions: List[str]
636) -> Generator[Task, None, None]:
637 """
638 Generates tasks from a collection, across task classes, simultaneously
639 adding to an audit description. Used for user-triggered downloads.
641 Args:
642 collection: a :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
643 audit_descriptions: list of strings to be modified
645 Yields:
646 :class:`camcops_server.cc_modules.cc_task.Task` objects
647 """ # noqa
648 for cls in collection.task_classes():
649 for task in gen_audited_tasks_for_task_class(
650 collection, cls, audit_descriptions
651 ):
652 yield task
655def get_information_schema_query(req: "CamcopsRequest") -> ResultProxy:
656 """
657 Returns an SQLAlchemy query object that fetches the
658 INFORMATION_SCHEMA.COLUMNS information from our source database.
660 This is not sensitive; there is no data, just structure/comments.
661 """
662 # Find our database name
663 # https://stackoverflow.com/questions/53554458/sqlalchemy-get-database-name-from-engine
664 dbname = req.engine.url.database
665 # Query the information schema for our database.
666 # https://docs.sqlalchemy.org/en/13/core/sqlelement.html#sqlalchemy.sql.expression.text # noqa
667 query = text(
668 """
669 SELECT *
670 FROM information_schema.columns
671 WHERE table_schema = :dbname
672 """
673 ).bindparams(dbname=dbname)
674 result_proxy = req.dbsession.execute(query)
675 return result_proxy
678def get_information_schema_spreadsheet_page(
679 req: "CamcopsRequest", page_name: str = INFOSCHEMA_PAGENAME
680) -> SpreadsheetPage:
681 """
682 Returns the server database's ``INFORMATION_SCHEMA.COLUMNS`` table as a
683 :class:`camcops_server.cc_modules.cc_spreadsheet.SpreadsheetPage``.
684 """
685 result_proxy = get_information_schema_query(req)
686 return SpreadsheetPage.from_resultproxy(page_name, result_proxy)
689def write_information_schema_to_dst(
690 req: "CamcopsRequest",
691 dst_session: SqlASession,
692 dest_table_name: str = INFOSCHEMA_PAGENAME,
693) -> None:
694 """
695 Writes the server's information schema to a separate database session
696 (which will be an SQLite database being created for download).
698 There must be no open transactions (i.e. please COMMIT before you call
699 this function), since we need to create a table.
700 """
701 # 1. Read the structure of INFORMATION_SCHEMA.COLUMNS itself.
702 # https://stackoverflow.com/questions/21770829/sqlalchemy-copy-schema-and-data-of-subquery-to-another-database # noqa
703 src_engine = req.engine
704 dst_engine = dst_session.bind
705 metadata = MetaData(bind=dst_engine)
706 table = Table(
707 "columns", # table name; see also "schema" argument
708 metadata, # "load with the destination metadata"
709 # Override some specific column types by hand, or they'll fail as
710 # SQLAlchemy fails to reflect the MySQL LONGTEXT type properly:
711 Column("COLUMN_DEFAULT", Text),
712 Column("COLUMN_TYPE", Text),
713 Column("GENERATION_EXPRESSION", Text),
714 autoload=True, # "read (reflect) structure from the database"
715 autoload_with=src_engine, # "read (reflect) structure from the source"
716 schema="information_schema", # schema
717 )
718 # 2. Write that structure to our new database.
719 table.name = dest_table_name # create it with a different name
720 table.schema = "" # we don't have a schema in the destination database
721 table.create(dst_engine) # CREATE TABLE
722 # 3. Fetch data.
723 query = get_information_schema_query(req)
724 # 4. Write the data.
725 for row in query:
726 dst_session.execute(table.insert(row))
727 # 5. COMMIT
728 dst_session.commit()
731# =============================================================================
732# Convert task collections to different export formats for user download
733# =============================================================================
736@register_for_json
737class DownloadOptions(object):
738 """
739 Represents options for the process of the user downloading tasks.
740 """
742 DELIVERY_MODES = [ViewArg.DOWNLOAD, ViewArg.EMAIL, ViewArg.IMMEDIATELY]
744 def __init__(
745 self,
746 user_id: int,
747 viewtype: str,
748 delivery_mode: str,
749 spreadsheet_simplified: bool = False,
750 spreadsheet_sort_by_heading: bool = False,
751 db_include_blobs: bool = False,
752 db_patient_id_per_row: bool = False,
753 include_information_schema_columns: bool = True,
754 include_summary_schema: bool = True,
755 ) -> None:
756 """
757 Args:
758 user_id:
759 ID of the user creating the request (may be needed to pass to
760 the back-end)
761 viewtype:
762 file format for receiving data (e.g. XLSX, SQLite)
763 delivery_mode:
764 method of delivery (e.g. immediate, e-mail)
765 spreadsheet_sort_by_heading:
766 (For spreadsheets.)
767 Sort columns within each page by heading name?
768 db_include_blobs:
769 (For database downloads.)
770 Include BLOBs?
771 db_patient_id_per_row:
772 (For database downloads.)
773 Denormalize by include the patient ID in all rows of
774 patient-related tables?
775 include_information_schema_columns:
776 Include descriptions of the database source columns?
777 include_summary_schema:
778 Include descriptions of summary columns and other columns in
779 output spreadsheets?
780 """
781 assert delivery_mode in self.DELIVERY_MODES
782 self.user_id = user_id
783 self.viewtype = viewtype
784 self.delivery_mode = delivery_mode
785 self.spreadsheet_simplified = spreadsheet_simplified
786 self.spreadsheet_sort_by_heading = spreadsheet_sort_by_heading
787 self.db_include_blobs = db_include_blobs
788 self.db_patient_id_per_row = db_patient_id_per_row
789 self.include_information_schema_columns = (
790 include_information_schema_columns
791 )
792 self.include_summary_schema = include_summary_schema
795class TaskCollectionExporter(object):
796 """
797 Class to provide tasks for user download.
798 """
800 def __init__(
801 self,
802 req: "CamcopsRequest",
803 collection: "TaskCollection",
804 options: DownloadOptions,
805 ):
806 """
807 Args:
808 req:
809 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
810 collection:
811 a :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
812 options:
813 :class:`DownloadOptions` governing the download
814 """ # noqa
815 self.req = req
816 self.collection = collection
817 self.options = options
819 @property
820 def viewtype(self) -> str:
821 raise NotImplementedError("Exporter needs to implement 'viewtype'")
823 @property
824 def file_extension(self) -> str:
825 raise NotImplementedError(
826 "Exporter needs to implement 'file_extension'"
827 )
829 def get_filename(self) -> str:
830 """
831 Returns the filename for the download.
832 """
833 timestamp = format_datetime(self.req.now, DateFormat.FILENAME)
834 return f"CamCOPS_dump_{timestamp}.{self.file_extension}"
836 def immediate_response(self, req: "CamcopsRequest") -> Response:
837 """
838 Returns either a :class:`Response` with the data, or a
839 :class:`Response` saying how the user will obtain their data later.
841 Args:
842 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
843 """
844 if self.options.delivery_mode == ViewArg.EMAIL:
845 self.schedule_email()
846 return render_to_response(
847 "email_scheduled.mako", dict(), request=req
848 )
849 elif self.options.delivery_mode == ViewArg.DOWNLOAD:
850 self.schedule_download()
851 return render_to_response(
852 "download_scheduled.mako", dict(), request=req
853 )
854 else: # ViewArg.IMMEDIATELY
855 return self.download_now()
857 def download_now(self) -> Response:
858 """
859 Download the data dump in the selected format
860 """
861 filename, body = self.to_file()
862 return self.get_data_response(body=body, filename=filename)
864 def schedule_email(self) -> None:
865 """
866 Schedule the export asynchronously and e-mail the logged in user
867 when done
868 """
869 email_basic_dump.delay(self.collection, self.options)
871 def send_by_email(self) -> None:
872 """
873 Send the data dump by e-mail to the logged in user
874 """
875 _ = self.req.gettext
876 config = self.req.config
878 filename, body = self.to_file()
879 email_to = self.req.user.email
880 email = Email(
881 # date: automatic
882 from_addr=config.email_from,
883 to=email_to,
884 subject=_("CamCOPS research data dump"),
885 body=_("The research data dump you requested is attached."),
886 content_type=CONTENT_TYPE_TEXT,
887 charset="utf8",
888 attachments_binary=[(filename, body)],
889 )
890 email.send(
891 host=config.email_host,
892 username=config.email_host_username,
893 password=config.email_host_password,
894 port=config.email_port,
895 use_tls=config.email_use_tls,
896 )
898 if email.sent:
899 log.info(f"Research dump emailed to {email_to}")
900 else:
901 log.error(f"Failed to email research dump to {email_to}")
903 def schedule_download(self) -> None:
904 """
905 Schedule a background export to a file that the user can download
906 later.
907 """
908 create_user_download.delay(self.collection, self.options)
910 def create_user_download_and_email(self) -> None:
911 """
912 Creates a user download, and e-mails the user to let them know.
913 """
914 _ = self.req.gettext
915 config = self.req.config
917 download_dir = self.req.user_download_dir
918 space = self.req.user_download_bytes_available
919 filename, contents = self.to_file()
920 size = len(contents)
922 if size > space:
923 # Not enough space
924 total_permitted = self.req.user_download_bytes_permitted
925 msg = _(
926 "You do not have enough space to create this download. "
927 "You are allowed {total_permitted} bytes and you are have "
928 "{space} bytes free. This download would need {size} bytes."
929 ).format(total_permitted=total_permitted, space=space, size=size)
930 else:
931 # Create file
932 fullpath = os.path.join(download_dir, filename)
933 try:
934 with open(fullpath, "wb") as f:
935 f.write(contents)
936 # Success
937 log.info(f"Created user download: {fullpath}")
938 msg = (
939 _(
940 "The research data dump you requested is ready to be "
941 "downloaded. You will find it in your download area. "
942 "It is called %s"
943 )
944 % filename
945 )
946 except Exception as e:
947 # Some other error
948 msg = _(
949 "Failed to create file {filename}. Error was: {message}"
950 ).format(filename=filename, message=e)
952 # E-mail the user, if they have an e-mail address
953 email_to = self.req.user.email
954 if email_to:
955 email = Email(
956 # date: automatic
957 from_addr=config.email_from,
958 to=email_to,
959 subject=_("CamCOPS research data dump"),
960 body=msg,
961 content_type=CONTENT_TYPE_TEXT,
962 charset="utf8",
963 )
964 email.send(
965 host=config.email_host,
966 username=config.email_host_username,
967 password=config.email_host_password,
968 port=config.email_port,
969 use_tls=config.email_use_tls,
970 )
972 def get_data_response(self, body: bytes, filename: str) -> Response:
973 raise NotImplementedError("Exporter needs to implement 'get_response'")
975 def to_file(self) -> Tuple[str, bytes]:
976 """
977 Returns the tuple ``filename, file_contents``.
978 """
979 return self.get_filename(), self.get_file_body()
981 def get_file_body(self) -> bytes:
982 """
983 Returns binary data to be stored as a file.
984 """
985 raise NotImplementedError(
986 "Exporter needs to implement 'get_file_body'"
987 )
989 def get_spreadsheet_collection(self) -> SpreadsheetCollection:
990 """
991 Converts the collection of tasks to a collection of spreadsheet-style
992 data. Also audits the request as a basic data dump.
994 Returns:
995 a
996 :class:`camcops_server.cc_modules.cc_spreadsheet.SpreadsheetCollection`
997 object
998 """ # noqa
999 audit_descriptions = [] # type: List[str]
1000 options = self.options
1001 if options.spreadsheet_simplified:
1002 summary_exclusion_tables = (
1003 REMOVE_TABLES_FOR_SIMPLIFIED_SPREADSHEETS
1004 )
1005 summary_exclusion_columns = (
1006 REMOVE_COLUMNS_FOR_SIMPLIFIED_SPREADSHEETS
1007 )
1008 else:
1009 summary_exclusion_tables = EMPTY_SET
1010 summary_exclusion_columns = EMPTY_SET
1011 # Task may return >1 sheet for output (e.g. for subtables).
1012 coll = SpreadsheetCollection()
1014 # Iterate through tasks, creating the spreadsheet collection
1015 schema_elements = set() # type: Set[SummarySchemaInfo]
1016 for cls in self.collection.task_classes():
1017 schema_done = False
1018 for task in gen_audited_tasks_for_task_class(
1019 self.collection, cls, audit_descriptions
1020 ):
1021 # Task data
1022 coll.add_pages(task.get_spreadsheet_pages(self.req))
1023 if not schema_done and options.include_summary_schema:
1024 # Schema (including summary explanations)
1025 schema_elements |= task.get_spreadsheet_schema_elements(
1026 self.req
1027 )
1028 # We just need this from one task instance.
1029 schema_done = True
1031 if options.include_summary_schema:
1032 coll.add_page(
1033 SpreadsheetPage(
1034 name=SUMMARYSCHEMA_PAGENAME,
1035 rows=[
1036 si.as_dict
1037 for si in sorted(schema_elements)
1038 if si.column_name not in summary_exclusion_columns
1039 and si.table_name not in summary_exclusion_tables
1040 ],
1041 )
1042 )
1044 if options.include_information_schema_columns:
1045 # Source database information schema
1046 coll.add_page(get_information_schema_spreadsheet_page(self.req))
1048 # Simplify
1049 if options.spreadsheet_simplified:
1050 coll.delete_pages(summary_exclusion_tables)
1051 coll.delete_columns(summary_exclusion_columns)
1053 # Sort
1054 coll.sort_pages()
1055 if options.spreadsheet_sort_by_heading:
1056 coll.sort_headings_within_all_pages()
1058 # Audit
1059 audit(self.req, f"Basic dump: {'; '.join(audit_descriptions)}")
1061 return coll
1064class OdsExporter(TaskCollectionExporter):
1065 """
1066 Converts a set of tasks to an OpenOffice ODS file.
1067 """
1069 file_extension = "ods"
1070 viewtype = ViewArg.ODS
1072 def get_file_body(self) -> bytes:
1073 return self.get_spreadsheet_collection().as_ods()
1075 def get_data_response(self, body: bytes, filename: str) -> Response:
1076 return OdsResponse(body=body, filename=filename)
1079class RExporter(TaskCollectionExporter):
1080 """
1081 Converts a set of tasks to an R script.
1082 """
1084 file_extension = "R"
1085 viewtype = ViewArg.R
1087 def __init__(self, *args, **kwargs) -> None:
1088 super().__init__(*args, **kwargs)
1089 self.encoding = "utf-8"
1091 def get_file_body(self) -> bytes:
1092 return self.get_r_script().encode(self.encoding)
1094 def get_r_script(self) -> str:
1095 return self.get_spreadsheet_collection().as_r()
1097 def get_data_response(self, body: bytes, filename: str) -> Response:
1098 filename = self.get_filename()
1099 r_script = self.get_r_script()
1100 return TextAttachmentResponse(body=r_script, filename=filename)
1103class TsvZipExporter(TaskCollectionExporter):
1104 """
1105 Converts a set of tasks to a set of TSV (tab-separated value) file, (one
1106 per table) in a ZIP file.
1107 """
1109 file_extension = "zip"
1110 viewtype = ViewArg.TSV_ZIP
1112 def get_file_body(self) -> bytes:
1113 return self.get_spreadsheet_collection().as_zip()
1115 def get_data_response(self, body: bytes, filename: str) -> Response:
1116 return ZipResponse(body=body, filename=filename)
1119class XlsxExporter(TaskCollectionExporter):
1120 """
1121 Converts a set of tasks to an Excel XLSX file.
1122 """
1124 file_extension = "xlsx"
1125 viewtype = ViewArg.XLSX
1127 def get_file_body(self) -> bytes:
1128 return self.get_spreadsheet_collection().as_xlsx()
1130 def get_data_response(self, body: bytes, filename: str) -> Response:
1131 return XlsxResponse(body=body, filename=filename)
1134class SqliteExporter(TaskCollectionExporter):
1135 """
1136 Converts a set of tasks to an SQLite binary file.
1137 """
1139 file_extension = "sqlite"
1140 viewtype = ViewArg.SQLITE
1142 def get_export_options(self) -> TaskExportOptions:
1143 return TaskExportOptions(
1144 include_blobs=self.options.db_include_blobs,
1145 db_include_summaries=True,
1146 db_make_all_tables_even_empty=True, # debatable, but more consistent! # noqa
1147 db_patient_id_per_row=self.options.db_patient_id_per_row,
1148 )
1150 def get_sqlite_data(self, as_text: bool) -> Union[bytes, str]:
1151 """
1152 Returns data as a binary SQLite database, or SQL text to create it.
1154 Args:
1155 as_text: textual SQL, rather than binary SQLite?
1157 Returns:
1158 ``bytes`` or ``str``, according to ``as_text``
1159 """
1160 # ---------------------------------------------------------------------
1161 # Create memory file, dumper, and engine
1162 # ---------------------------------------------------------------------
1164 # This approach failed:
1165 #
1166 # memfile = io.StringIO()
1167 #
1168 # def dump(querysql, *multiparams, **params):
1169 # compsql = querysql.compile(dialect=engine.dialect)
1170 # memfile.write("{};\n".format(compsql))
1171 #
1172 # engine = create_engine('{dialect}://'.format(dialect=dialect_name),
1173 # strategy='mock', executor=dump)
1174 # dst_session = sessionmaker(bind=engine)() # type: SqlASession
1175 #
1176 # ... you get the error
1177 # AttributeError: 'MockConnection' object has no attribute 'begin'
1178 # ... which is fair enough.
1179 #
1180 # Next best thing: SQLite database.
1181 # Two ways to deal with it:
1182 # (a) duplicate our C++ dump code (which itself duplicate the SQLite
1183 # command-line executable's dump facility), then create the
1184 # database, dump it to a string, serve the string; or
1185 # (b) offer the binary SQLite file.
1186 # Or... (c) both.
1187 # Aha! pymysqlite.iterdump does this for us.
1188 #
1189 # If we create an in-memory database using create_engine('sqlite://'),
1190 # can we get the binary contents out? Don't think so.
1191 #
1192 # So we should first create a temporary on-disk file, then use that.
1194 # ---------------------------------------------------------------------
1195 # Make temporary file (one whose filename we can know).
1196 # ---------------------------------------------------------------------
1197 # We use tempfile.mkstemp() for security, or NamedTemporaryFile,
1198 # which is a bit easier. However, you can't necessarily open the file
1199 # again under all OSs, so that's no good. The final option is
1200 # TemporaryDirectory, which is secure and convenient.
1201 #
1202 # https://docs.python.org/3/library/tempfile.html
1203 # https://security.openstack.org/guidelines/dg_using-temporary-files-securely.html # noqa
1204 # https://stackoverflow.com/questions/3924117/how-to-use-tempfile-namedtemporaryfile-in-python # noqa
1205 db_basename = "temp.sqlite3"
1206 with tempfile.TemporaryDirectory() as tmpdirname:
1207 db_filename = os.path.join(tmpdirname, db_basename)
1208 # ---------------------------------------------------------------------
1209 # Make SQLAlchemy session
1210 # ---------------------------------------------------------------------
1211 url = "sqlite:///" + db_filename
1212 engine = create_engine(url, echo=False)
1213 dst_session = sessionmaker(bind=engine)() # type: SqlASession
1214 # ---------------------------------------------------------------------
1215 # Iterate through tasks, creating tables as we need them.
1216 # ---------------------------------------------------------------------
1217 audit_descriptions = [] # type: List[str]
1218 task_generator = gen_audited_tasks_by_task_class(
1219 self.collection, audit_descriptions
1220 )
1221 # ---------------------------------------------------------------------
1222 # Next bit very tricky. We're trying to achieve several things:
1223 # - a copy of part of the database structure
1224 # - a copy of part of the data, with relationships intact
1225 # - nothing sensitive (e.g. full User records) going through
1226 # - adding new columns for Task objects offering summary values
1227 # - Must treat tasks all together, because otherwise we will insert
1228 # duplicate dependency objects like Group objects.
1229 # ---------------------------------------------------------------------
1230 copy_tasks_and_summaries(
1231 tasks=task_generator,
1232 dst_engine=engine,
1233 dst_session=dst_session,
1234 export_options=self.get_export_options(),
1235 req=self.req,
1236 )
1237 dst_session.commit()
1238 if self.options.include_information_schema_columns:
1239 # Must have committed before we do this:
1240 write_information_schema_to_dst(self.req, dst_session)
1241 # ---------------------------------------------------------------------
1242 # Audit
1243 # ---------------------------------------------------------------------
1244 audit(self.req, f"SQL dump: {'; '.join(audit_descriptions)}")
1245 # ---------------------------------------------------------------------
1246 # Fetch file contents, either as binary, or as SQL
1247 # ---------------------------------------------------------------------
1248 if as_text:
1249 # SQL text
1250 connection = sqlite3.connect(
1251 db_filename
1252 ) # type: sqlite3.Connection # noqa
1253 sql_text = sql_from_sqlite_database(connection)
1254 connection.close()
1255 return sql_text
1256 else:
1257 # SQLite binary
1258 with open(db_filename, "rb") as f:
1259 binary_contents = f.read()
1260 return binary_contents
1262 def get_file_body(self) -> bytes:
1263 return self.get_sqlite_data(as_text=False)
1265 def get_data_response(self, body: bytes, filename: str) -> Response:
1266 return SqliteBinaryResponse(body=body, filename=filename)
1269class SqlExporter(SqliteExporter):
1270 """
1271 Converts a set of tasks to the textual SQL needed to create an SQLite file.
1272 """
1274 file_extension = "sql"
1275 viewtype = ViewArg.SQL
1277 def __init__(self, *args, **kwargs) -> None:
1278 super().__init__(*args, **kwargs)
1279 self.encoding = "utf-8"
1281 def get_file_body(self) -> bytes:
1282 return self.get_sql().encode(self.encoding)
1284 def get_sql(self) -> str:
1285 """
1286 Returns SQL text representing the SQLite database.
1287 """
1288 return self.get_sqlite_data(as_text=True)
1290 def download_now(self) -> Response:
1291 """
1292 Download the data dump in the selected format
1293 """
1294 filename = self.get_filename()
1295 sql_text = self.get_sql()
1296 return TextAttachmentResponse(body=sql_text, filename=filename)
1298 def get_data_response(self, body: bytes, filename: str) -> Response:
1299 """
1300 Unused.
1301 """
1302 pass
1305# Create mapping from "viewtype" to class.
1306# noinspection PyTypeChecker
1307DOWNLOADER_CLASSES = {} # type: Dict[str, Type[TaskCollectionExporter]]
1308for _cls in gen_all_subclasses(
1309 TaskCollectionExporter
1310): # type: Type[TaskCollectionExporter] # noqa
1311 # noinspection PyTypeChecker
1312 DOWNLOADER_CLASSES[_cls.viewtype] = _cls
1315def make_exporter(
1316 req: "CamcopsRequest",
1317 collection: "TaskCollection",
1318 options: DownloadOptions,
1319) -> TaskCollectionExporter:
1320 """
1322 Args:
1323 req:
1324 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
1325 collection:
1326 a
1327 :class:`camcops_server.cc_modules.cc_taskcollection.TaskCollection`
1328 options:
1329 :class:`camcops_server.cc_modules.cc_export.DownloadOptions`
1330 governing the download
1332 Returns:
1333 a :class:`BasicTaskCollectionExporter`
1335 Raises:
1336 :exc:`HTTPBadRequest` if the arguments are bad
1337 """
1338 _ = req.gettext
1339 if options.delivery_mode not in DownloadOptions.DELIVERY_MODES:
1340 raise HTTPBadRequest(
1341 f"{_('Bad delivery mode:')} {options.delivery_mode!r} "
1342 f"({_('permissible:')} "
1343 f"{DownloadOptions.DELIVERY_MODES!r})"
1344 )
1345 try:
1346 downloader_class = DOWNLOADER_CLASSES[options.viewtype]
1347 except KeyError:
1348 raise HTTPBadRequest(
1349 f"{_('Bad output type:')} {options.viewtype!r} "
1350 f"({_('permissible:')} {DOWNLOADER_CLASSES.keys()!r})"
1351 )
1352 return downloader_class(req=req, collection=collection, options=options)
1355# =============================================================================
1356# Represent files for users to download
1357# =============================================================================
1360class UserDownloadFile(object):
1361 """
1362 Represents a file that has been generated for the user to download.
1364 Test code:
1366 .. code-block:: python
1368 from camcops_server.cc_modules.cc_export import UserDownloadFile
1369 x = UserDownloadFile("/etc/hosts")
1371 print(x.when_last_modified) # should match output of: ls -l /etc/hosts
1373 many = UserDownloadFile.from_directory_scan("/etc")
1375 """
1377 def __init__(
1378 self,
1379 filename: str,
1380 directory: str = "",
1381 permitted_lifespan_min: float = 0,
1382 req: "CamcopsRequest" = None,
1383 ) -> None:
1384 """
1385 Args:
1386 filename:
1387 Filename, either absolute, or if ``directory`` is specified,
1388 relative to ``directory``.
1389 directory:
1390 Directory. If specified, ``filename`` must be within it.
1391 req:
1392 a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
1394 Notes:
1396 - The Unix ``ls`` command shows timestamps in the current timezone.
1397 Try ``TZ=utc ls -l <filename>`` or ``TZ="America/New_York" ls -l
1398 <filename>`` to see this.
1399 - The underlying timestamp is the time (in seconds) since the Unix
1400 "epoch", which is 00:00:00 UTC on 1 Jan 1970
1401 (https://en.wikipedia.org/wiki/Unix_time).
1402 """
1403 self.filename = filename
1404 self.permitted_lifespan_min = permitted_lifespan_min
1405 self.req = req
1407 self.basename = os.path.basename(filename)
1408 _, self.extension = os.path.splitext(filename)
1409 if directory:
1410 # filename must be within the directory specified
1411 self.directory = os.path.abspath(directory)
1412 candidate_path = os.path.abspath(
1413 os.path.join(self.directory, filename)
1414 )
1415 if os.path.commonpath([directory, candidate_path]) != directory:
1416 # Filename is not within directory.
1417 # This is dodgy -- someone may have passed a filename like
1418 # "../../dangerous_dir/unsafe_content.txt"
1419 self.fullpath = ""
1420 # ... ensures that "exists" will be False.
1421 else:
1422 self.fullpath = candidate_path
1423 else:
1424 # filename is treated as an absolute path
1425 self.directory = ""
1426 self.fullpath = filename
1428 try:
1429 self.statinfo = os.stat(self.fullpath)
1430 self.exists = True
1431 except FileNotFoundError:
1432 self.statinfo = None # type: Optional[os.stat_result]
1433 self.exists = False
1435 # -------------------------------------------------------------------------
1436 # Size
1437 # -------------------------------------------------------------------------
1439 @property
1440 def size(self) -> Optional[int]:
1441 """
1442 Size of the file, in bytes. Returns ``None`` if the file does not
1443 exist.
1444 """
1445 return self.statinfo.st_size if self.exists else None
1447 @property
1448 def size_str(self) -> str:
1449 """
1450 Returns a pretty-format string describing the file's size.
1451 """
1452 size_bytes = self.size
1453 if size_bytes is None:
1454 return ""
1455 return bytes2human(size_bytes)
1457 # -------------------------------------------------------------------------
1458 # Timing
1459 # -------------------------------------------------------------------------
1461 @property
1462 def when_last_modified(self) -> Optional[Pendulum]:
1463 """
1464 Returns the file's modification time, or ``None`` if it doesn't exist.
1466 (Creation time is harder! See
1467 https://stackoverflow.com/questions/237079/how-to-get-file-creation-modification-date-times-in-python.)
1468 """ # noqa
1469 if not self.exists:
1470 return None
1471 # noinspection PyTypeChecker
1472 creation = Pendulum.fromtimestamp(
1473 self.statinfo.st_mtime, tz=get_tz_utc()
1474 ) # type: Pendulum
1475 # ... gives the correct time in the UTC timezone
1476 # ... note that utcfromtimestamp() gives a time without a timezone,
1477 # which is unhelpful!
1478 # We would like this to display in the current timezone:
1479 return creation.in_timezone(get_tz_local())
1481 @property
1482 def when_last_modified_str(self) -> str:
1483 """
1484 Returns a formatted string with the file's modification time.
1485 """
1486 w = self.when_last_modified
1487 if not w:
1488 return ""
1489 return format_datetime(w, DateFormat.ISO8601_HUMANIZED_TO_SECONDS)
1491 @property
1492 def time_left(self) -> Optional[Duration]:
1493 """
1494 Returns the amount of time that this file has left to live before
1495 the server will delete it. Returns ``None`` if the file does not exist.
1496 """
1497 if not self.exists:
1498 return None
1499 now = get_now_localtz_pendulum()
1500 death = self.when_last_modified + Duration(
1501 minutes=self.permitted_lifespan_min
1502 )
1503 remaining = death - now # type: Period
1504 # Note that Period is a subclass of Duration, but its __str__()
1505 # method is different. Duration maps __str__() to in_words(), but
1506 # Period maps __str__() to __repr__().
1507 return remaining
1509 @property
1510 def time_left_str(self) -> str:
1511 """
1512 A string version of :meth:`time_left`.
1513 """
1514 t = self.time_left
1515 if not t:
1516 return ""
1517 return t.in_words() # Duration and Period do nice formatting
1519 def older_than(self, when: Pendulum) -> bool:
1520 """
1521 Was the file created before the specified time?
1522 """
1523 m = self.when_last_modified
1524 if not m:
1525 return False
1526 return m < when
1528 # -------------------------------------------------------------------------
1529 # Deletion
1530 # -------------------------------------------------------------------------
1532 @property
1533 def delete_form(self) -> str:
1534 """
1535 Returns HTML for a form to delete this file.
1536 """
1537 if not self.req:
1538 return ""
1539 dest_url = self.req.route_url(Routes.DELETE_FILE)
1540 form = UserDownloadDeleteForm(request=self.req, action=dest_url)
1541 appstruct = {ViewParam.FILENAME: self.filename}
1542 rendered_form = form.render(appstruct)
1543 return rendered_form
1545 def delete(self) -> None:
1546 """
1547 Deletes the file. Does not raise an exception if the file does not
1548 exist.
1549 """
1550 try:
1551 os.remove(self.fullpath)
1552 log.info(f"Deleted file: {self.fullpath}")
1553 except OSError:
1554 pass
1556 # -------------------------------------------------------------------------
1557 # Downloading
1558 # -------------------------------------------------------------------------
1560 @property
1561 def download_url(self) -> str:
1562 """
1563 Returns a URL to download this file.
1564 """
1565 if not self.req:
1566 return ""
1567 querydict = {ViewParam.FILENAME: self.filename}
1568 return self.req.route_url(Routes.DOWNLOAD_FILE, _query=querydict)
1570 @property
1571 def contents(self) -> Optional[bytes]:
1572 """
1573 The file contents. May raise :exc:`OSError` if the read fails.
1574 """
1575 if not self.exists:
1576 return None
1577 with open(self.fullpath, "rb") as f:
1578 return f.read()
1580 # -------------------------------------------------------------------------
1581 # Bulk creation
1582 # -------------------------------------------------------------------------
1584 @classmethod
1585 def from_directory_scan(
1586 cls,
1587 directory: str,
1588 permitted_lifespan_min: float = 0,
1589 req: "CamcopsRequest" = None,
1590 ) -> List["UserDownloadFile"]:
1591 """
1592 Scans the directory and returns a list of :class:`UserDownloadFile`
1593 objects, one for each file in the directory.
1595 For each object, ``directory`` is the root directory (our parameter
1596 here), and ``filename`` is the filename RELATIVE to that.
1598 Args:
1599 directory: directory to scan
1600 permitted_lifespan_min: lifespan for each file
1601 req: a :class:`camcops_server.cc_modules.cc_request.CamcopsRequest`
1602 """
1603 results = [] # type: List[UserDownloadFile]
1604 # Imagine directory == "/etc":
1605 for root, dirs, files in os.walk(directory):
1606 # ... then root might at times be "/etc/apache2"
1607 for f in files:
1608 fullpath = os.path.join(root, f)
1609 relative_filename = relative_filename_within_dir(
1610 fullpath, directory
1611 )
1612 results.append(
1613 UserDownloadFile(
1614 filename=relative_filename,
1615 directory=directory,
1616 permitted_lifespan_min=permitted_lifespan_min,
1617 req=req,
1618 )
1619 )
1620 return results