|
http://www.codeproject.com/database/xp_pcre.asp 相关文件下载: http://www.chinalabs.com/resource/soft/dev/xp_pcre.zip ------------------------------------------------------------------------------------ Introduction xp_pcre is a follow up to my Extended Stored Procedure xp_regex. Both allow you to use regular expressions in T-SQL on Microsoft SQL Server 2000. This version was written because xp_regex uses the .NET Framework and there are many people who are not able to install .NET on their SQL Servers. xp_pcre is so named because it uses the "Perl Compatible Regular Expressions" library. This library is available at http://www.pcre.org/. I've also used PCRE++ which is a set of C++ classes to make using PCRE easier. PCRE++ is available at http://www.daemon.de/en/software/pcre/. These links are provided only for reference; you don't need to download them in order to use xp_pcre. I've included everything in the ZIP file. Update: This time I actually have included everything. Previously, the ZIP file had a copy of xp_regex.dll instead of xp_pcre.dll. My mistake. Overview There are four Extended Stored Procedures in the DLL: xp_pcre_match xp_pcre_replace xp_pcre_split xp_pcre_show_cache The parameters of all of these procedures can be CHAR, VARCHAR or TEXT of any SQL Server-supported length. The only exception is the @column_number parameter of xp_pcre_split, which is an INT. If any required parameters are NULL, no matching will be performed and the @result parameter will be left unchanged. 1. xp_pcre_match Syntax: EXEC xp_pcre_match @input, @regex, @result OUTPUT @input is the text to check. @regex is the regular expression to match. @result is an output parameter that will hold either '0' or '1'. xp_pcre_match checks to see if the input matches the regular expression. If so, @result will be set to 1. If not, @result is set to 0. If either @input or @regex is NULL, @result will be unchanged. 2. xp_pcre_replace Syntax: EXEC xp_pcre_replace @input, @regex, @replacement, @result OUTPUT @input is the text to parse. @regex is the regular expression to match. @replacement is what each match will be replaced with. @result is an output parameter that will hold the result. xp_pcre_replace is a search-and-replace function. All matches will be replaced with contents of the @replacement parameter. For those who have used xp_regex, this function can be used in place of both xp_regex_format and xp_regex_replace. For example, this is how you would remove all whitespace from an input string: DECLARE @out VARCHAR(8000) EXEC xp_pcre_replace 'one two three four ', '\s+', '', @out OUTPUT PRINT '[' + @out + ']' prints out: [onetwothreefour] To replace all numbers (regardless of length) with "###": DECLARE @out VARCHAR(8000) EXEC xp_pcre_replace '12345 is less than 99999, but not 1, 12, or 123', '\d+', '###', @out OUTPUT PRINT @out prints out: ### is less than ###, but not ###, ###, or ### The next example will show how to achieve similar behavior to xp_regex_format, Regex.Result() in .NET, or string interpolation in Perl (i.e. $formatted_phone_number = "($1) $2-$3") The regex ^.*?(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}).*$ will parse just about any phone-number-like string you throw at it. For instance, this code: DECLARE @out VARCHAR(50) EXEC xp_pcre_replace '(310)555-1212', '^.*?(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}).*$', '($1) $2-$3', @out OUTPUT PRINT @out EXEC xp_pcre_replace '310.555.1212', '^.*?(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}).*$', '($1) $2-$3', @out OUTPUT PRINT @out EXEC xp_pcre_replace ' 310!555 hey! 1212 hey!', '^.*?(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}).*$', '($1) $2-$3', @out OUTPUT PRINT @out EXEC xp_pcre_replace ' hello, ( 310 ) 555.1212 is my phone number. Thank you.', '^.*?(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}).*$', '($1) $2-$3', @out OUTPUT PRINT @out prints out: (310) 555-1212 (310) 555-1212 (310) 555-1212 (310) 555-1212 For those of you who have used xp_regex_format, you'll notice a slight difference in the regular expressions. They all start with ^.*? and end with .*$. The reason is becuase we need to match the entire string since we are doing a replacement of only what matches. The ^.*? and .*$ match, respectively, the beginning and end of the input string (along with any extra characters before and after). 3. xp_pcre_split Syntax: EXEC xp_pcre_split @input, @regex, @column_number, @result OUTPUT @input is the text to parse. @regex is a regular expression that matches the delimiter. @column_number indicates which column to return. @result is an output parameter that will hold the formatted results. Column numbers start at 1. An error will be raised if @column_number is less than 1. In the event that @column_number is greater than the number of columns that result from the split, the value of @result will be unchanged. This function splits text data on some sort of delimiter (comma, pipe, whatever). The cool thing about a split using regular expressions is that the delimiter does not have to be as consistent as you would normally expect. For example, take this line as your source data: one ,two|three : four In this case, our delimiter is either a comma, pipe or colon with any number of spaces either before or after (or both). In regex form, that is written: \s*[,|:]\s*. For example: DECLARE @out VARCHAR(8000) EXEC xp_pcre_split 'one ,two|three : four', '\s*[,|:]\s*', 1, @out OUTPUT PRINT @out EXEC xp_pcre_split 'one ,two|three : four', '\s*[,|:]\s*', 2, @out OUTPUT PRINT @out EXEC xp_pcre_split 'one ,two|three : four', '\s*[,|:]\s*', 3, @out OUTPUT PRINT @out EXEC xp_pcre_split 'one ,two|three : four', '\s*[,|:]\s*', 4, @out OUTPUT PRINT @out prints out: one two three four 4. xp_pcre_show_cache Syntax: EXEC xp_pcre_show_cache This procedure returns a result set containing all of the regular expressions currently in the cache. There's really no need to use it in the course of normal operations, but I found it useful during development. 5. fn_pcre_match, fn_pcre_split and fn_pcre_replace These are user-defined functions that wrap the stored procedures. This way you can use the function as part of a SELECT list, a WHERE clause, or anywhere else you can use an expression (like CHECK constraints!). To me, using the UDFs is a much more natural way to use this library. USE pubs GO SELECT dbo.fn_pcre_replace( phone, '^.*?(\d{3})[^\d]*(\d{3})[^\d]*(\d{4}).*$', '($1) $2-$3' ) as formatted_phone FROM authors This would format every phone number in the "authors" table. Please note, you'll need to create the UDFs in every database that you use them in. The above example will probably fail unless you have created the UDFs in the Pubs database. 6. Installation Copy xp_pcre.dll and pcre-0.dll into your \Program Files\Microsoft SQL Server\MSSQL\binn folder. Run the SQL script INSTALL.SQL. This will register the procedures and create the user-defined functions in the Master database. See the section "User-defined function installation" in the INSTALL.SQL if you want to use the UDFs from databases other than Master. 7. Important safety tip You can end up sending PCRE into an infinite loop if you're not careful with your regular expressions. During testing, I attempted to run the following query: DECLARE @out VARCHAR(8000) EXEC xp_pcre_replace 'one two three four ', '\s*', '', @out OUTPUT PRINT '[' + @out + ']' The only difference between this query and the (correct) one above, is the change from '\s+' to '\s*'. This sent PCRE into an infinite loop because it was searching and replacing every occurrence of zero or more spaces. Since the beginning of the string does indeed match (it has zero spaces), it would replace those zero spaces and continue where it left off. Since it left off at position zero, that's where it picked up. And again it matches the "zero spaces" before the first character. Since the match is zero-width, it always pick up its next match at position 0 (again right before the first character). And it will continue this until you stop the SQL Server service. Since what we really wanted to do was to replace any instance of one or more spaces, we needed to change the * to a +. 8. Unicode support Unfortunately, this version does not support Unicode arguments. Potential solutions include: Use xp_regex. Internally, the .NET Framework is 100% Unicode. Use the Boost Regex++ library. Unfortunately, this means giving up a lot of the newer regular expression functionality (zero-width assertions, cloistered pattern modifiers, etc.) Have xp_pcre convert to UTF-8, which is supported by PCRE. This is probably the most workable solution for those who can't use the .NET version, but since I don't use Unicode data in SQL Server, I haven't implemented it. We'll leave this as the dreaded "exercise for the reader." :) Use CAST, CONVERT or implicit conversions in the UDFs to coerce the arguments to ASCII. This is what will happen by default. But, unless you're storing plain-old ASCII in Unicode columns, this probably won't work for you. 9. Misc To build the code, you'll need to have the Boost libraries installed. You can download them from http://www.boost.org/. Just change the "Additional Include Directories" entry under the project properties in VS.NET. It's under Configuration Properties | C/C++ | General. Comments/corrections/additions are welcome. Thanks! 10. History 6 Oct 03 - Updated ZIP to include xp_pcre.dll. Mentioned the Boost requirement in the Misc section. Cleaned up the documentation a bit. 10 Aug 03 - Initial release Dan Farino Click here to view Dan Farino's online profile. Other popular Database articles: Exposing tabular data from your COM object - Part 2 The ATL OLE DB Provider templates appear to rely on the fact that your data is kept in a simple array, but that's not really the case at all! A set of ADOX Classes Simple database catalog access using a set of ADOX classes ADO Connection Strings A list Of ODBC DSN Connection Strings A set of ADO Classes - version 2.10 Simple database Access using an ADO class
|