.. _plugin_arch:

Plugin Architecture
===================

Abstract:
---------

The purpose of plugins is to create a decoupled architecture such that ScanCode can
support extensibility at different stages of a scan. These stages are:

* Pre-scan: Before starting the scan proper, such as plugins to handle
  extraction of different archive types or instructions on how to handle certain
  types of files, or to collect filetypes. These plugins process a whole codebase
  at once.

* Scan proper: plugins to scan a file e.g. collect data and evidece from the
  files. These plugins process one file at a teim and can do a whole codebase
  pass once all files are scanned.

* Post-scan: After the scan, e.g plugins for summarization and other aggregated
  operation once all scans are completed. These plugins process a whole codebase
  at once.

* Output and output filter: plugins for output creation and filtering such as
  formatting or converting output to other formats (such as json, spdx, csv,
  yaml). These plugins process a whole codebase at once.


Description:
------------

This project aims at making scancode a “pluggable” system, where new
functionalities can be added to scancode at runtime as “plugins”. These plugins
can be hooked into scancode using some predefined hooks. I would consider pluggy
as the way to go for a plugin management system.

Why pluggy?
^^^^^^^^^^^

Pluggy is well documented and maintained regularly, and has proved its worth in
projects such as pytest. Pluggy relies on hook specifications and hook
implementations (callbacks) instead of the conventional subclassing approach
which may encourage tight-coupling in the overlying framework. Basically a hook
specification contains method signatures (no code), these are defined by the
application. A hook implementation contains definitions for methods declared in
the corresponding hook specification implemented by a plugin.

As mentioned in the abstract, the plugin architecture will have 3 hook
specifications (can be increased if required)

1. Pre - scan hook
^^^^^^^^^^^^^^^^^^

- **Structure** -

::

   prescan_hookspec = HookspecMarker('prescan')

   @prescan_hookspec
   def extract_archive(args):

Here the path of the archive to be extracted will be passed as an argument to the extract_archive
function which will be called before scan, at the time of extraction. This will process the archive
type and extract the contents accordingly. This functionality can be further extended by calling
this function if any archive is found inside the scanning tree.

2. Scan proper hook
^^^^^^^^^^^^^^^^^^^


- **Structure**

::

   scanproper_hookspec = HookspecMarker('scanproper')

   @scanproper_hookspec
   def add_cmdline_option(args):

This function will be called before starting the scan, without any arguments, it will return a dict
containing the click extension details and possibly some help text. If this option is called by the
user then the call will be rerouted to the callback defined by the click extension. For instance
say a plugin implements functionality to add regex as a valid ignore pattern, then this function
will return a dict as::

   {
       'name': '--ignore-regex',
       'options' : {
           'default': None,
           'multiple': True,
           'metavar': <pattern>
       },
       'help': 'Ignore files matching regex <pattern>'
       'call_after': 'is_ignored'
   }

According to the above dict, if the option --ignore-regex is supplied, this function will be called
after the is_ignored function and the data returned by the is_ignored function will be supplied to
this function as its argument(s). So if the program flow was::

   scancode() ⇔ scan() ⇔ resource_paths() ⇔ is_ignored()


It will now be edited to

::

   scancode() ⇔ scan() ⇔ resource_paths() ⇔ is_ignored() ⇔ add_cmdline_option()


Options such as **call_after, call_before, call_first, call_last** can be defined to determine
when the function is to be executed.

::

   @scanproper_hookspec
   def dependency_scan(args):

This function will be called before starting the scan without any arguments, it will return a
list of file types or attributes which if encountered in the scanned tree, will call this function
with the path to the file as an argument. This function can do some extra processing on those files
and return the data to be processed as a dependency for the normal scanning process.
E.g. It can return a list such as::

   [ 'debian/copyright' ]

Whenever a file matches this pattern, this function will be called and the data returned will be
supplied to the main scancode function.

3. Post - scan hook
^^^^^^^^^^^^^^^^^^^


- **Structure** -

::

   postscan_hookspec = HookspecMarker('postscan')

   @postscan_hookspec
   def format_output(args):

This function will be called after a scan is finished. It will be supplied with path to the ABC
data generated from the scan, path to the root of the scanned code and a path where the output is
expected to be stored. The function will store the processed data in the output path supplied.
This can be used to convert output to other formats such as CSV, SPDX, JSON, etc.

::

   @postscan_hookspec
   def summarize_output(args):

This function will be called after a scan is finished. It will be supplied the data to be reported
to the user as well as a path to the root of the scanned node. The data returned can then be
reported to the user. This can be used to summarize output, maybe encapsulate the data to be
reported or omit similar file metadata or even classify files such as tests, code proper, licenses,
readme, configs, build scripts etc.


- **Identifying or configuring plugins**

For python plugins, pluggy supports loading modules from setuptools entrypoints,
E.g.

::

       entry_points = {
           'scancode_plugins': [
               'name_of_plugin = ignore_regex',
           ]
       }

This plugin can be loaded using the PluginManager class’s
load_setuptools_entrypoints('scancode_plugins') method which will return a list of loaded plugins.

For non python plugins, all such plugins will be stored in a common directory and each of these
plugins will have a manifest configuration in YAML format. This directory will be scanned at
startup for plugins. After parsing the config file of a plugin, the data will be supplied to the
plugin manager as if it were supplied using setuptools entrypoints.

In case of non python plugins, the plugin executables will be spawned in their own processes and
according to their config data, they will be passed arguments and would return data as necessary.
In addition to this, the desired hook function can be called from a non python plugin using certain
arguments, which again can be mapped in the config file.

Sample config file for a ignore_regex plugin calling scanproper hook would be::

   name: ignore_regex
   hook: scanproper
   hookfunctions:
     add_cmdline_option: '-aco'
     dependency_scan: '-dc'
   data:
     add_cmdline_option':
       - name: '--ignore-regex'
       - options:
           - default: None
           - multiple: True
           - metavar: <pattern>
       - help: 'Ignore files matching regex <pattern>'
       - call_after: 'is_ignored'

Existing solutions:
-------------------

An alternate solution to a “pluggable” system would be the more conventional approach of adding
functionalities directly to the core codebase, which removes the abstraction layer provided by
a plugin management and hook calling system.
