1. jsre API¶
The module provides functions that provide simplified access to the objects described below. These functions also allow the use of string search targets as well as the bytes supported by normal jsre objects. This is achieved by encoding strings as utf-32-be before searching.
The module uses standard Python exceptions, notably ‘SyntaxError’ for errors in the pattern,
1.2. Regular Expression Objects¶
-
class
jsre.
ReCompiler
(pattern=None, encoding=('utf_8', ), flags=0, offset=0, stride=0, utf32_bytecorrection=False)¶ Manages the compilation of multiple REs and encodings.
The flags encodings and patterns specified for a single object of this class will be encoded into a single VM and searched simultaneously. This compiler builds a RegexObject object which wraps the virtual regex machine used to run pattern matching.
Several patterns may be compiled into the same execution object at the same time, and each may have different flags, encodings etc specified.
Keyword Arguments: pattern – a str regular expression.
encoding – a list of encodings to search, default value is utf_8.
flags –
I or IGNORECASE: Character matching is case insensitive. Basic unicode folding is implemented. (ie equivalent characters are not just upper and lower case.)
M or MULTILINE: ^ and $ match the start/end of each line well as the buffer start/end.
S or DOTALL: dot ‘.’ matches line ends as well as all other characters.
VERBOSE: Ignore while space including line breaks and comments within the re outside character classes. A comment may be placed between ‘#’ and a line end.
INDEXALT: Index top level alternatives - eg index keyword hit in list of alternative words this is more scalable than using groups to identify matched alternatives. Only keywords are indexed, not other alternative expressions.
ASYNCHRONOUS: Test for a match at every byte, not on the normal character boundaries.
SECTOR: Specify stride and offset of match anchors. e.g. stride = 512, offset = 0 will test at the start of every sector of a disk image. (Otherwise stride and offset not needed.)
stride – the stride value for moving anchors.
offset – a start offset into buffer to be searched.
utf32_bytecorrection – divides match locations by 4 to give string address for 4 byte Unicode encoding.
Raises: SyntaxError
– if the pattern is invalid-
compile
()¶ Builds a regular expression object which is used to execute the pattern matching.
Returns: a regular expression object, supports search, findall, finditer. Return type: RegexObject
-
setPattern
(pattern)¶ Update the search specification to include the specified pattern.
This pattern is compiled using any previously specified options or flags. The pattern specified will be searched in addition to any previously set patterns.
Parameters: pattern – a regular expression, in str format.
-
update
(encoding=None, flags=None, offset=0, stride=0)¶ Update compiler encoding list, flags and options.
Parameters specified using update will override any already set; they will apply to any subsequently specified patterns. See ReCompiler for argument details.
Keyword Arguments: - encoding – a list of encodings to be used.
- flags – combination of required flags (I, M, S, etc).
- offset – start offset into buffer to be searched.
- stride – stride value for anchor moving (0 is default move for the encoding).
-
class
jsre.
RegexObject
(tvm, reMap, altMap, utf32_bytecorrection)¶ Provides the pattern matching interface together with the compiled matching engine.
This pattern matcher is designed to simultaneously run several different combinations of regular expressions, flags, and encodings. As a consequence these attributes and others such as groups are provided in match results, and not from the execution object (unlike re and some other systems.)
The pattern matching process checks each pattern/encoding combination in turn; in consequence although order is preserved for individual patterns/encodings there is a possibility of overlaps between matches from different patterns. Search will also report from the first pattern/encoding to match.
All methods have the same argument signature: (buffer [,start [,end [,endanchor]]] )
Parameters: - buffer – a byte object to search.
- start – the first byte in the buffer from which to search (default 0). It is assumed that the start will be on the minimum character word boundary of any encodings used (ie 2 byte for utf16, 4 byte for utf32).
- end – index of the last byte to be searched + 1 (i.e. normal python slice end). A regular expression will fail if it gets to this point and has not matched.
- endAnchor – The last byte to be used as an re anchor (ie the last position from which the pattern match check should begin).
The ability to specify both the buffer end and the end anchor allows long data steams to be split into blocks and ensure that all possible anchor points are searched without duplication and with a specified search window which is the overlap between blocks.
-
findall
(*spec)¶ Returns all search as a list, or if multiple group matches within patterns, a list of tuples.
The results are returns as stings (str), the successful encoding is used to decode the byte array and provide a string version of the result.
Arguments: (buffer [,start [,end [,endanchor]]] ) - see RegexObject() for documentation.
Returns: List of strings or list of tuples containing strings, see Match.groups().
-
finditer
(*spec)¶ Returns an iterator over all matches in the given buffer.
Arguments: (buffer [,start [,end [,endanchor]]] ) - see RegexObject() for documentation.
Returns: A iterator over Match objects.
-
search
(*spec)¶ Returns the first pattern match in the given buffer.
Arguments: (buffer [,start [,end [,endanchor]]] ) - see RegexObject() for documentation.
Returns: A Match object, or None.
1.3. Match Objects¶
-
class
jsre.
Match
(res, searchSpec, reSpec, altWord, utf32_bytecorrection)¶ Describes a single successful match.
A Match object always evaluates as true.
As far as possible this class provides the same interface as the standard Python re Match together with some extensions which allow the retrieval of the expression and encoding associated with a particular match, and (if INDEXALT) the keyword component of the pattern which matched.
The characters matched by the regular expression are always returned as a string by decoding the byte search target using whatever encoding was successful.
Usually the (start, stop) indexes of a match group are returned as byte indexes. However if utf32_bytecorrection was set as a ReCompiler option the indexes are divided by 4 to allow them to be regarded as string indexes, provided the pattern used 32 bit Unicode encoding.
Methods that require a group index as an argument can instead be provided with a group name, if the group is named in the regular expression.
-
pos
¶ the start position specified for this match (see RegexObject).
-
endpos
¶ the end of buffer specified for this match.
-
endAnchor
¶ the last anchor position allowed for this match.
-
lastindex
¶ the integer index of the last matched capturing group.
-
lastgroup
¶ the name of the last matched capturing group.
-
re
¶ the regular expression used for this match.
-
encoding
¶ the encoding used for this match.
-
buf
¶ the bytes buffer used for this match.
-
end
(*args)¶ Returns the end of group, or of whole match, see start().
-
getAltWord
()¶ Returns the keyword which was successful in this match.
This is intended to allow the efficient indexing of results obtained by matching large numbers of keywords.
The feature is enabled by the INDEXALT flag, and returns None if not enabled or there is no indexed match.
Returns: A string which is the keyword (alternative regular expression sub-pattern) provided by the user. This may not be the same as the string which matched if IGNORECASE is specified.
-
group
(*args)¶ Returns a decoded substring for one or more match subgroups.
Parameters: group – One or more group indexes or names. Group 0 is always the whole match and group names may be used instead of indexes. Returns: If there is a single argument a string, otherwise a tuple of strings.
-
groupdict
(*args)¶ Returns a dictionary of named group matches.
Parameters: default – specifies the value for groups that did not participate in the match, if not provided None is the default. Returns: A dictionary which maps group name to group string matched.
-
groups
(*args)¶ Returns a tuple of the subgroups of the match.
Keyword Arguments: default – specifies the value for groups that did not participate in the match. if not specified None is the default. Returns: A tuple of decoded strings, with groups that failed to match replaced by the default value.
-
span
(*args)¶ Returns the start and end of match as a tuple. See start().
-
start
(*args)¶ Return index of start of group, or of the whole match.
start and end together allow retrieval of the bytes in the usual way: match.buf[start(i):end(i)]
Parameters: group – The ID or name of the group. If not specified, group 0 (the whole match) is returned. Returns: An integer index to bytes in the search target, or to utf32 string position (see Match() documentation.
-