DragonFly On-Line Manual Pages

Search: Section:  


PCRS(3)               DragonFly Library Functions Manual               PCRS(3)

NAME

pcrs - Perl-compatible regular substitution.

SYNOPSIS

#include <pcrs.h> pcrs_job *pcrs_compile(const char *pattern, const char *substitute, const char *options, int *errptr); pcrs_job *pcrs_compile_command(const char *command, int *errptr); int pcrs_execute(pcrs_job *job, char *subject, int subject_length, char **result, int *result_length); int pcrs_execute_list (pcrs_job *joblist, char *subject, int subject_length, char **result, int *result_length); pcrs_job *pcrs_free_job(pcrs_job *job); void pcrs_free_joblist(pcrs_job *joblist); char *pcrs_strerror(int err);

DESCRIPTION

The PCRS library is a supplement to the PCRE(3) library that implements regular expression based substitution, like provided by Perl(1)'s 's' operator. It uses the same syntax and semantics as Perl 5, with just a few differences (see below). In a first step, the information on a substitution, i.e. the pattern, the substitute and the options are compiled from Perl syntax to an internal form called pcrs_job by using either the pcrs_compile() or pcrs_compile_command() functions. Once the job is compiled, it can be used on subjects, which are arbitrary memory areas containing string or binary data, by calling pcrs_execute(). Jobs can be chained to joblists and whole joblists can be applied to a subject using pcrs_execute_list(). There are also convenience functions for freeing the jobs and for errno-to-string conversion, namely pcrs_free_job(), pcrs_free_joblist() and pcrs_strerror().

COMPILING JOBS

The function pcrs_compile() is called to compile a pcrs_job from a pattern, substitute and options string. The resulting pcrs_job structure is dynamically allocated and it is the caller's responsibility to call pcrs_free_job() when it's no longer needed. pcrs_compile_command() is a convenience wrapper function that parses a Perl command of the form s/pattern/substitute/[options] into its components and then calls pcrs_compile(). As in Perl, you are not bound to the '/' character: Whatever follows the 's' will be used as the delimiter. Patterns or substitutes that contain the delimiter need to quote it: s/th\/is/th\/at/ will replace th/is by th/at and can be written more simply as s|th/is|th/at|. pattern, substitute, options and command must be zero-terminated C strings. substitute and options may be NULL, in which case they are treated like the empty string. Return value and diagnostics On success, both functions return a pointer to the compiled job. On failure, NULL is returned. In that case, the pcrs error code is written to *err. Patterns For the syntax of the pattern, see the PCRE(3) manual page. Substitutes The substitute uses Perl syntax as documented in the perlre(1) manual page, with some exceptions: Most notably and evidently, since PCRS is not Perl, variable interpolation or Perl command substitution won't work. Special variables that do get interpolated, are: $1, $2, ..., $n Like in Perl, these variables refer to what the nth capturing subpattern in the pattern matched. $& and $0 refer to the whole match. Note that $0 is deprecated in recent Perl versions and now refers to the program name. $+ refers to what the last capturing subpattern matched. $` and $' (backtick and tick) refer to the areas of the subject before and after the match, respectively. Note that, like in Perl, the unmodified subject is used, even if a global substitution previously matched. Perl4-style references to subpattern matches of the form \1, \2, ... which only exist in Perl5 for backwards compatibility, are not supported. Also, since the substitute is a double-quoted string in Perl, you might expect all Perl syntax for special characters to apply. In fact, only the following are supported: \n newline (0x0a) \r carriage return (0x0d) \t horizontal tab (0x09) \f form feed (0x0c) \b backspace (0x08) \a alarm, bell (0x07) \e escape (0x1b) \0 binary zero (0x00) Options The options gmisx are supported. e is not, since it would require a Perl interpreter and neither is o, because the pattern is explicitly compiled, anyway. Additionally, PCRS honors the options U and T. Where PCRE options are mentioned below, refer to PCRE(3) for the subtle differences to Perl behaviour. g Replace all instances of pattern in subject, not just the first one. i Match the pattern without respect to case. This translates to PCRE_CASELESS. m Treat the subject as consisting of multiple lines, i.e. '^' matches immediately after, and '$' immediately before each newline. Translates to PCRE_MULTILINE. s Treat the subject as consisting of one single line, i.e. let the scope of the '.' metacharacter include newlines. Translates to PCRE_DOTALL. x Allow extended regular expression syntax in the pattern, enabling whitespace and comments in complex patterns. Translates to PCRE_EXTENDED. U Switch the default behaviour of the '*' and '*' quantifiers to ungreedy. Note that appending a '?' switches back to greedy(!). The explicit in-pattern switches (?U) and (?-U) remain unaffected. Translates to PCRE_UNGREEDY. T Consider the substitute trivial, i.e. do not interpret any references or special character escape sequences in the substitute. Handy for large user-supplied substitutes, which would otherwise have to be examined and properly quoted. Unsupported options are silently ignored.

EXECUTING JOBS

Calling pcrs_execute() produces a modified copy of the subject, in which the first (or all, if the 'g' option was given when compiling the job) occurance(s) of the job's pattern in the subject is replaced by the job's substitute. The first subject_length bytes following subject are processed, so a subject_length that exceeds the actual subject is dangerous. Note that for zero-terminated C strings, you should set subject_length to strlen(subject), so that the dollar metacharacter matches at the end of the string, not after the string-terminating null byte. For convenience, an extra null byte is appended to the result so it can again be used as a string. The subject itself is left untouched, and the *result is dynamically allocated, so it is the caller's responsibility to free() it when it's no longer needed. The result's length (excluding the extra null byte) is written to *result_length. If the job matched, the PCRS_SUCCESS flag in job->flags is set. String subjects If your Return value and diagnostics On success, pcrs_execute() returns the number of substitutions that were made, which is limited to 0 or 1 for non-global searches. On failure, a negative error code is returned and result is set to NULL.

FREEING JOBS

It is not sufficient to call free() on a pcrs_job, because it contains pointers to other dynamically allocated structures. Use pcrs_free_job() instead. It is safe to pass NULL pointers (or pointers to invalid pcrs_jobs that contain NULL pointers to dependant structures) to pcrs_free_job(). Return value The value of the job's next pointer.

CHAINING JOBS

PCRS supports to some extent the chaining of multiple pcrs_job structures by means of their next member. Chaining the jobs is up to you, but once you have built a linked list of jobs, you can execute a whole joblist on a given subject by a single call to pcrs_execute_list(), which will sequentially traverse the linked list until it reaches a NULL pointer, and call pcrs_execute() for each job it encounters, feeding the result and result_length of each call into the next as the subject and subject_length. As in the single job case, the original subject remains untouched, but all interim results are of course free()d. The return value is the accumulated number of matches for all jobs in the joblist. Note that while this is handy, it reduces the diagnostic value of err, since you won't know which job failed. In analogy, you can free all jobs in a given joblist by calling pcrs_free_joblist().

QUOTING

The quote character is (surprise!) '\'. It quotes the delimiter in a command, the '$' in a substitute, and, of course, itself. Note that the '$' doesn't need to be quoted if it isn't followed by [0-9+'`&]. For quoting in the pattern, please refer to PCRE(3).

DIAGNOSTICS

When compiling a job either via the pcrs_compile() or pcrs_compile_command() functions, you know that something went wrong when you are returned a NULL pointer. In that case, or in the event of non-fatal warnings, the integer pointed to by err contains a nonzero error code, which is either a passed-through PCRE error code or one generated by PCRS. Under normal circumstances, it can take the following values: PCRE_ERROR_NOMEMORY While compiling the pattern, PCRE ran out of memory. PCRS_ERR_NOMEM While compiling the job, PCRS ran out of memory. PCRS_ERR_CMDSYNTAX pcrs_compile_command() didn't find four tokens while parsing the command. PCRS_ERR_STUDY A PCRE error occured while studying the compiled pattern. Since pcre_study() only provides textual diagnostic information, the details are lost. PCRS_WARN_BADREF The substitute contains a reference to a capturing subpattern that has a higher index than the number of capturing subpatterns in the pattern or that exceeds the current hard limit of 33 (See LIMITATIONS below). As in Perl, this is non-fatal and results in substitutions with the empty string. When executing jobs via pcrs_execute() or pcrs_execute_list(), a negative return code indicates an error. In that case, *result is NULL. Possible error codes are: PCRE_ERROR_NOMEMORY While matching the pattern, PCRE ran out of memory. This can only happen if there are more than 33 backrefrences in the pattern(!) and memory is too tight to extend storage for more. PCRS_ERR_NOMEM While executing the job, PCRS ran out of memory. PCRS_ERR_BADJOB The pcrs_job* passed to pcrs_execute was NULL, or the job is bogus (it contains NULL pointers to the compiled pattern, extra, or substitute). If you see any other PCRE error code passed through, you've either messed with the compiled job or found a bug in PCRS. Please send me an email. Ah, and don't look for PCRE_ERROR_NOMATCH, since this is not an error in the context of PCRS. Should there be no match, an exact copy of the subject is found at *result and the return code is 0 (matches). All error codes can be translated into human readable text by means of the pcrs_strerror() function.

EXAMPLE

A trivial command-line test program for PCRS might look like: #include <pcrs.h> #include <stdio.h> int main(int Argc, char **Argv) { pcrs_job *job; char *result; size_t newsize; int err; if (Argc != 3) { fprintf(stderr, "Usage: %s s/pattern/substitute/[options] subject\n", Argv[0]); return 1; } if (NULL == (job = pcrs_compile_command(Argv[1], &err))) { fprintf(stderr, "%s: compile error: %s (%d).\n", Argv[0], pcrs_strerror(err), err); } if (0 > (err = pcrs_execute(job, Argv[2], strlen(Argv[2]), &result, &newsize))) { fprintf(stderr, "%s: exec error: %s (%d).\n", Argv[0], pcrs_strerror(err), err); } else { printf("Result: *%s*\n", result); free(result); } pcrs_free_job(job); return(err < 0); }

LIMITATIONS

The number of matches that a global job can have is only limited by the available memory. An initial storage for 40 matches is reserved, which is dynamically resized by the factor 1.6 whenever it is exhausted. The number of capturing subpatterns is currently limited to 33, which is a Bad Thing[tm]. It should be dynamically expanded until it reaches the PCRE limit of 99. This limitation is particularly embarassing since PCRE 3.5 has raised the capturing subpattern limit to 65K. All of the above values can be adjusted in the "Capacity" section of pcrs.h. The Perl-style escape sequences for special characters \nnn, \xnn, and \cX are currently unsupported.

BUGS

This library has only been tested in the context of one application and should be considered high risk.

HISTORY

PCRS was originally written for the Privoxy project (http://www.privoxy.org/).

SEE ALSO

PCRE(3), perl(1), perlre(1)

AUTHOR

PCRS is Copyright 2000 - 2003 by Andreas Oesterhelt <andreas@oesterhelt.org> and is licensed under the terms of the GNU Lesser General Public License (LGPL), version 2.1, which should be included in this distribution, with the exception that the permission to replace that license with the GNU General Public License (GPL) given in section 3 is restricted to version 2 of the GPL. If it is missing from this distribution, the LGPL can be obtained from http://www.gnu.org/licenses/lgpl.html or by mail: Write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. pcrs-0.0.3 2 December 2003 PCRS(3)

Search: Section: