.. _ref_API:

============================
Module Functions and Objects
============================

.. module:: jsre
.. moduleauthor:: Howard Chivers

This section provides interface information for *jsre* module functions Classes and methods. 
See :ref:`ref_examples` for an overview with examples.

The module uses standard Python exceptions, notably ``SyntaxError`` for errors in the regular expression pattern; 
if logging is enabled it will also log syntax errors in a more helpful way which indicates the position of the
error in the regular expression. For example::

	>>> pattern = r'abc(?)def'
	>>> jsre.compile(pattern)
        
	parser       ERROR    abc(?)def
	parser       ERROR        ^
	parser       ERROR    Syntax error - RE or group starts with a repeat specification (nothing to repeat) at 4

The module functions provide simple shortcuts for the more comprehensive :class:`RegexObject` methods. Module functions 
cache compiled regular expressions so repeated calling via module functions avoids the compilation overhead. 
However, the :class:`RegexObject` methods provide a richer set of functionality and allows better control of 
which objects are retained in memory.

*jsre* also provides a :class:`ReCompiler` class which provides a more comprehensive set of compilation 
functions than is available from the module level ``compile()``. This class allows combinations of 
encodings and patterns to be combined into a single :class:`RegexObject` instance.

.. _ref_api_module:

----------------
Module Functions
----------------

The module level functions ``search()``, ``match()``, ``finditer()``, ``findall()`` allow the search target 
(the data to be searched) to be either bytes or str.  If a string target is presented any resulting :class:`Match` 
instances will index the original string.

(Note - matching patterns in string targets is currently presented as a compatibility feature. Strings presented
to module functions are encoded into UTF-32 before matching and this may incur a speed overhead.)

In *jsre* all patterns are strings, regardless of the type of the search target.

The following constants may be used as flags in both module and object functions:

.. data:: 	I
			IGNORECASE

	Matching is to be case insensitive; for example 'a' will match both 'a' and 'A'. 
	Full UNICODE case folding is supported.  

.. data::	M
			MULTILINE

	The special characters ``^`` and ``$`` match at the beginning and end of a line respectively 
	(after/before newline characters). The default is for these characters to match only at the 
	beginning and end of the input buffer. 

.. data::   S
            DOTALL

	The special character ``.`` matches any valid codepoint (character). The default behaviour 
	is for ``.`` to match all codepoints except newline characters.

..	data::	X
			VERBOSE

	Allows the formatting of regular expessions to be more readable.

	Whitespace within a regular expression is ignored, except where it is in a character class 
	or preceeded with a backslash. Text between ``#`` and the next line is also ignored (ie is 
	a comment), again provided that ``#`` is unescaped and not within a character class. 
	Whitespace is not allowed between the start of a group and any extension syntax;
	for example *(  ?  P  <name>)* would not parse, but ``(?P< name >)`` is accepted. 
	Similarly whitespace is not allowed between a backspace and the following character 
	(e.g. ``\w`` ) or within code point specifications. 

.. data::	INDEXALT

	Specifies that alternatives are indexed. This allows (sub)expressions specified as alternatives 
	in the regular expression to be retrieved from a :class:`Match` object. This is much more efficient 
	(and much more scalable) than using submatch groups to identify individual alternatives in a 
	big list. See :ref:`ref_example_keyword` for an example.

.. data::	SECTOR

	This enables the specification of stride and offset of anchor positions within within 
	the search target, for example to search at only disk sector boundaries. See 
	:ref:`ref_example_sector` for an example.

**Module level functions are:**

.. function::	compile(pattern[, flags])

	Compile the regular expression *pattern* and return a :class:`RegexObject` object which 
	allows searching etc. using the methods below. Note that the :class:`ReCompiler` class 
	may be used to compile combinations of expressions and encodings into a single matching object.

.. function::	search(pattern, target[, flags])

	Search through the *target* (string or bytes) to find the first matching *pattern*, 
	and returns the corresponding :class:`Match` instance. Returns ``None`` if 
	no match is found. The *pattern* must be a string, regardless of the type of *target*; 
	if the *target* is a string then it is encoded using utf-32-be before matching; 
	if a byte array then the default encoding (utf-8) is assumed. If different encodings 
	are required the ``RegexObject`` methods provide a much wider range of options.

.. function::	match(pattern, target[, flags])

	Attempt to match the *pattern* starting at the first character in the target. In other words the 
	function is the same as ``search()`` but only succeeds if there is a match at the start of 
	the *target* string or buffer.

.. function::	findall(pattern, target[, flags])

	Returns all non-overlapping matches of *pattern* in the *target* as a list of strings, 
	or a list of tuples. If the *pattern* has sub-match groups then the result will be a 
	tuple in which the first value is the overall match and subsequent values are 
	the groups defined in the regular expression. Non-matching groups will return a ``None`` 
	entry in the tuple.

.. function::	finditer(pattern, target[, flags])

	Returns an *iterator* of :class:`Match` instances over non-overlapping 
	matches of *pattern* in the given *target*. 

.. function::	purge()

	Clears the regular expression cache.

.. _ref_api_compiler:

---------------------------
Regular Expression Compiler
---------------------------

.. class::	ReCompiler(pattern=None, encoding=('utf_8',), flags=0, offset=0, stride=0)

	The :class:`ReCompiler` class may be used as an alternative to the module level ``compile()`` 
	function to allow combinations of expressions, flags and encodings to be compiled into a 
	single :class:`RegexObject` matching engine. 
	
	The class allows a search specification (flags, encodings) to be set and then one or 
	more patterns to be associated with the current search specification. The specification 
	may be changed using ``update(...)`` and more patterns added to the existing specification
	using ``setPattern(...)``. Finally the :class:`RegexObject` is obtained via the ``compile()``
	method. See examples in :ref:`ref_example_compile`. 
	
	The :class:`Match` object resulting from a successful match has attributes of ``re`` and ``encoding``
	which document the pattern and encoding that resulted in the match.  

	.. attribute:: encoding
	
	If present is a list or tuple of encodings; all the standard python codecs are supported. 
	The most common encodings are installed by default and others can be added if required. 
	See :ref:`ref_install_encoding` for more detail.
	
	.. attribute:: flags
	
	Flags from those defined for this module (see above); multiple flags may be combined by 
	addition (``+``) or logical or (``|``).  
	
	.. attribute:: offset, stride
	
	If the ``SECTOR`` flag is set then integer values for offset and stride may be specified to 
	control the allowed positions of matching anchors (the places from which a match is tested). 
	See examples at :ref:`ref_example_sector`.
	
	.. attribute:: pattern
	
	The pattern is a regular expression presented as a string. (It is always a string, regardless 
	of what encoding or set of encodings are to be searched.) When a pattern is registered it 
	is compiled using the current search specification.
	
	The :class:`ReCompiler` class provides the following methods:
	
		.. method:: update(encoding=None, flags=None, offset=0, stride=0)
		
		This method allows the current search specification to be updated; the new specification
		will apply to any patterns registered after the update, but not to any that are already 
		registered. *offset* and *stride* are only required if the ``SECTOR`` flag is set and the
		*encoding* and *flags* arguments are as decribed above.
		
		If one of *encoding* or *flags* is not specified, or is ``None``, it will not be updated.
		
		.. method:: setPattern(pattern)
		
		Registers a pattern to the compiler. The pattern is a regular expression which will be compiled
		using the current search specification. The pattern is added to those already specified as an
		independent regular expression.
		
		.. method:: compile()
		
		Returns a :class:`RegexObject` instance compiled from the various search specifications and 
		patterns previously specified.

.. _ref_api_regex:

--------------------------
Regular Expression Objects
--------------------------

.. class::	RegexObject

    The :class:`RegexObject` class encapsulates a compiled matching engine and provides 
    methods that allow the engine to be used to match patterns in buffers.
    
    This pattern matcher is designed to simultaneously run several different combinations
    of regular expressions, flags, and encodings. As a consequence these attributes are 
    provided by :class:`Match` objects`, and not as attributes of this class.

    The pattern matching process checks each pattern/encoding combination in turn; in
    consequence although order is preserved for individual patterns/encodings
    there is a possibility of overlaps between matches from different patterns. Search will
    also report the first pattern/encoding to match, not necessarily the earliest
    position in the string from which a match could have been found (e.g. by a different encoding).

    All methods except ``match()`` have the same argument signature *(buffer [,start [,end [,endanchor]]] )*:

	**buffer**  A *byte* object to search. Note that :class:`RegexObject` does not support string 
	objects, unlike the module level functions that encode strings if they are presented. 

	**start**  The first byte in the buffer from which to search (default 0). It is assumed that 
	the start will be on the minimum character word boundary of any encodings used (ie 2 byte for 
	utf16, 4 byte for utf32).

	**end**  The index of the last byte to be searched + 1 (i.e. normal Python slice end). A 
	regular expression will fail if it gets to this point and has not matched. 

	**endAnchor**  The last byte to be used as a match anchor + 1 (ie the last position 
	from which the pattern match check should begin). The ability to specify both the buffer 
	end and the end anchor allows long data streams to be split into blocks and ensure that 
	all possible anchor points are searched without duplication and with a specified search 
	window. See example in :ref:`ref_example_largeFile`.

	The :class:`RegexObject` provides the following methods:

		.. method:: search(buffer [,start [,end [,endanchor]]] )

		Search through the *buffer* (string or bytes) to find the first matching *pattern*, 
		and returns the corresponding :class:`Match` instance. Returns ``None`` if 
		no match is found. 

		.. method:: match(buffer [,start [,end]] )

		Attempt to match the *pattern* starting at the first character in the buffer. In other words the 
		function is the same as ``search()`` but only succeeds if there is a match at the start of 
		the target buffer.

		.. method:: findall(buffer [,start [,end [,endanchor]]] )

		Returns all non-overlapping matches of *pattern* in the *buffer* as a list of strings, 
		or a list of tuples. If the *pattern* has sub-match groups then the result will be a 
		tuple in which the first value is the overall match and subsequent values are 
		the groups. Non-matching groups will return a ``None`` entry in the tuple.

		.. method:: finditer(buffer [,start [,end [,endanchor]]] )

		Returns an *iterator* of :class:`Match` instances over non-overlapping 
		matches of *pattern* in the given *target*. 
   
.. _ref_api_match:
   
-------------
Match Objects
-------------

.. class::	Match

    A :class:`Match` instance reports a single successful match; a Match object always evaluates as true.

    The methods of this class provide the same signatures as the standard Python *re* module. However, 
    because *jsre* supports multiple expressions and encodings :class:`Match` objects also provide 
    attributes which allow the retrieval of the expression and encoding associated with a particular 
    match, and (if ``INDEXALT``) the keyword component of the pattern which matched.

    The text matched by the regular expression is always returned as a string by decoding
    the byte buffer using whatever encoding was successful.

    Usually the indexes of a match group (start, stop) are returned as byte indexes. However, if the match
    resulted from a module function which was presented with a string target the indexes are corrected to
    reference positions in the original string. 

    Methods that require a group index as an argument can instead be provided with a group name, if the
    group is named in the regular expression.

    Available class attributes:

	.. attribute:: pos, endpos, endAnchor

	The buffer start, buffer end and last anchor buffer positions specified for this match in the :class:`RegexObject` 
	method that resulted in the :class:`Match` object.	

	.. attribute:: lastindex

	The integer index of the last matched capturing group; note that this is the last group that 
	resulted in a match in the expression, not necessarily the highest numbered group (which may not have matched).
    
	.. attribute:: lastgroup   

	The name of the group corresponding to the *lastindex*, if it was named.
    
	.. attribute:: re

	The regular expression that resulted in this :class:`Match` object.

	.. attribute:: encoding

	The encoding that resulted in this :class:`Match` object.

	.. attribute:: flags

	The flags used to compile this regular expression.

	.. attribute:: buf

	The byte buffer that was matched.

	.. attribute:: keypattern

	If a pattern which is one of a set of alternatives within an expression was matched, 
	and the ``INDEXALT`` flag was set this is the pattern that matched. Note that this 
	is the pattern as specified in the regular expression, not as found in the buffer. This is 
	helpful in the common case where ``IGNORECASE`` was set resulting in many varients of the original 
	pattern being matched.

    The class :class:`Match` provides the following methods:

        .. method:: group([group1, ...])
        
        Returns a decoded string for one or more match groups. The byte buffer which has been searched will
        be decoded by the encoding that resulted in the hit, providing a string result. If one group is 
        specified the result is a single string, if several are provided the result is a tuple of strings. 
        Group 0 is always the whole match and is the default if no groups are specified; groups within 
        the regular expression are indexed starting from 1 and group names, if specified, may be used 
        instead of numbers.
        
        .. method:: groups([default])
        
        Returns a tuple of all the groups in a match, decoded to string using the encoding which matched.
        Groups that did not contibute to the match are returned as ``None``, or as the default value 
        if this is provided as a parameter.
        
        .. method:: groupdict([default])
        
        Returns a dict which maps group names to matched groups decoded to strings. If a group is not matched
        then the dictionary maps to ``None`` or the default value if this is provided as a parameter.
        
        .. method:: start([group])
        .. method:: end([group])
        
        Returns the start or end of a given group, or of the whole match if a group is not specified. The 
        value of -1 is used to signifiy that the specified group did not match. Note that matching an 
        empty string is different from failling to match; if an empty string is matched ``m.start()`` and 
        ``m.end()`` have the same value. Groups may be named or numbered, and if no group is provided group 0 
        (the whole match) is the default value. The end value is the index of the last characer + 1,  
        i.e. ``[m.start(), m.end()]`` is a normal Python range.  
        
        The normal values returned are byte indexes into a byte buffer. So that::
        
        	m.group()  = m.buf[m.start(), m.end()].decode(m.encoding)
        
        However, if the match is a result of providing a module function with a string target, then the 
        start and end values are corrected to index values in that string.
        
        .. method:: span([group])
        
        Returns the tuple ``(start([m.group]), end([m.group})``. 
          
