by_get package

Blue Yonder coding task.

by_get module

Blue Yonder coding task for Python Developer position.

Given a plaintext file containing URLs, one per line, e.g.:

http://mywebserver.com/images/271947.jpg http://mywebserver.com/images/24174.jpg http://somewebsrv.com/img/992147.jpg

Write a script that takes this plaintext file as an argument and downloads all images, storing them on the local hard disk. Approach the problem as you would any task in a normal day’s work. Imagine this code will be used in important live systems, modified later on by other developers, and so on.

Please use the Python programming language for your solution. We prefer to receive your code in GitHub or a similar repository.

get_url(url, session=None, filter_html=True, **requests_kwargs)[source]

Retrieve server response from given URL as byte stream.

Attempts to filter out non-image responses like HTML pages by default. Subsequent requests to the same server will re-use a connection of the given session if possible.

Parameters:
  • url (str, required) – A valid URL for an image resource.
  • filter_html (bool, optional) – Should HTML and other text-based responses be filtered out?
  • requests_kwargs (optional) – Additional keyword arguments to be passed on to the requests module.
Returns:

Return type:

A Response object of the requests module.

Raises:
  • TypeError – Response is not an image.
  • HTTPError – Server responds with HTML status code indicating an error.
  • InvalidSchema – The URL is invalid and can not be requested.
  • MissingSchema – A valid protocol schema, eg. ‘http://‘, is missing from the URL.
  • ReadTimeout – Server did not respond within 10 seconds.
hash_string(some_string)[source]

Calculate the human-readable SHA256 hash of a string.

Parameters:some_string (str, required) – String to be hashed.
Returns:
Return type:String of 64 hexadecimal characters.
main(argv)[source]

Download images from a given plaintext file of URLs.

Good URLs and the resulting image file names are written to STDOUT. Bad URLs and their error codes are written to STDERR. Images are saved to the working directory and named with the SHA256 hash of the (clean) source URL to avoid overwriting homonymous files from other URLs.

Redirects are handled transparently and non-image responses (e.g. HTML from domain parking servers) are filtered out. Existing connections are re-used where possible to reduce the overhead of connection negotiation when repeatedly requesting images from the same server.

sanitize_urls(file_object)[source]

Basic input sanitation filtering out whitespace, empty lines, and unsafe characters.

Invalid URLs, missing and wrong schemata are handled by Requests module!

Parameters:file_object (file, required) – An opened file object for a text file with one line per URL.
Yields:Lazy iterator over sanitized URLs.

version module

Program version.

Follows the semantic versioning style as described at http://semver.org/.

Table Of Contents

Previous topic

by_get API documentation

This Page