Defines the execution of genconfig
The genconfig command depends on predefined Jinja2 templates for the skeleton configuration files. Taking the –type argument from the CLI input, the corresponding template file is used.
Settings for the configuration file, like project name, selector type and URL are taken from the CLI input and using these as parameters, the template is rendered. This rendered JSON document is saved as <project_name>.json.
Defines the execution of generate
The generate command uses Jinja2 templates to create Python scripts, according to the specification in the configuration file. The predefined templates use the extract_content() method of the selector classes to implement linear extractors and use recursive for loops to implement multiple levels of link crawlers. This implementation is effectively a representation of the traverse_next() utility function, using the loop depth to differentiate between levels of the crawler execution.
According to the –output_type argument in the CLI input, the results are written into a JSON document or a CSV document.
The Python script is written into <output_filename>.py - running this file is the equivalent of using the Scrapple run command.
Defines the execution of run
The run command implements the web content extractor corresponding to the given configuration file.
The execute_command() validates the input project name and opens the JSON configuration file. The run() method handles the execution of the extractor run.
The extractor implementation follows these primary steps :
Defines the execution of web
The web command runs the Scrapple web interface through a simple Flask app.
When the execute_command() method is called from the runCLI() function, it starts of two simultaneous processes :
The ‘/’ view of the Flask app, opens up the Scrapple web interface. This provides a basic form, to fill in the required configuration file. On submitting the form, it makes a POST request, passing in the form in the request header. This form is passed to the form_to_json() utility function, where the form is converted into the resultant JSON configuration file.
Currently, closing the web command execution requires making a keyboard interrupt on the command line after the web interface has been closed.