php中文漢字替換與模式匹配的問題??!請(qǐng)大家必看!
Jun 21, 2016 am 09:15 AM漢字|問題|中文
作者: bluedoor
原帖地址:http://www.anbbs.com/anbbs/index.php?f_id=3&page=1
這兩天正在做一個(gè)關(guān)鍵字加亮顯示的程序,寫好的程序在本地測(cè)試也跑得好好的,可是一上去頁面就出現(xiàn)一堆一堆的亂碼,別說加亮了,簡(jiǎn)直就是沒的看!
我就找錯(cuò)誤,找來找去,發(fā)現(xiàn)英文沒有問題,遇到漢字容易出問題,有的時(shí)候遇到漢字必出問題。
總結(jié)一下:
當(dāng)使用模式匹配的時(shí)候,如:preg_match_all($pat,……)與preg_replace($pat,……)……
容易出問題的情況如下:
preg_match_all("/(漢字)+/ism","我是漢字,看你把我怎么著!",$m_a);
這個(gè)模式很簡(jiǎn)單就是匹配出“漢字”。這種情況模式中包含漢字可以成功匹配出來,但是也不要高興得太早,結(jié)果不確定,為什么不確定你慢慢往下看。
必出現(xiàn)問題情況如下:
preg_match_all("/[漢字]+/ism","我是漢字,看你把我怎么著!",$m_a);
本想匹配出現(xiàn)“漢”、“字”或者“漢字”。這個(gè)必出現(xiàn)問題,匹配的結(jié)果一大群亂碼,沒準(zhǔn)還會(huì)出個(gè)死循環(huán)呢。為什么會(huì)出現(xiàn)這種情況?是因?yàn)镻HP內(nèi)部使用不是UNICODE,不支持多字節(jié)文字,所以一個(gè)"漢字"就被當(dāng)成4bytes的ASCII去進(jìn)行模式匹配,不出錯(cuò)才怪呢!
后來我又試試重新寫一下模式匹配,發(fā)現(xiàn)一種似乎(為什么說似乎?往后看)方法可以解決:
preg_match_all("/(漢|字)+/ism","我是漢字,看你把我怎么著!",$m_a);
這樣寫可以匹配出“漢”、“字”或者“漢字”,$m_a中的結(jié)果
Array
(
[0] => Array
(
[0] => 漢字
)
[1] => Array
(
[0] => 字
)
)
怎么樣全匹配的字符串出現(xiàn)了吧!可是高興得太早了,后來在實(shí)際中用還是會(huì)經(jīng)常出問題!再去找問題,終于找到問題的根了!PHP不支持多字節(jié)文字,所以在進(jìn)行模式匹配與字符操作的時(shí)候都是內(nèi)碼轉(zhuǎn)化后進(jìn)行的(我不知道這樣說對(duì)不對(duì)),舉個(gè)實(shí)例吧:
eregi_replace("性","沒有" , "有責(zé)任感");這個(gè)操作就是要把字符串"有責(zé)任感"中"性"字替換成"沒有",最后的結(jié)果是什么?因?yàn)?有責(zé)任感"中沒有"性"就個(gè)字,結(jié)果應(yīng)該是沒有執(zhí)行替換操作返回"有責(zé)任感",可是結(jié)果竟然是"用揮敘任感"!
沒想到吧!為什么?看一下ASCII碼你就明白了,2個(gè)ASCII碼代碼一個(gè)漢字"有責(zé)任感"的ASCII編碼依次為:211,208(有),212,240(責(zé)),200,206(任),184,208(感)
而"性"的編碼為:208,212(性),恰好與有的第2字節(jié)和責(zé)的第1字節(jié)組合是一致的!所以PHP就認(rèn)識(shí)找到相同的模式進(jìn)行匹配,拆成一半的漢字再與替換后的字串進(jìn)行組合,所以就出錯(cuò)了!
當(dāng)時(shí)我想最常用的str_replace(),應(yīng)該不會(huì)有問題的,但是事實(shí)上str_replace()執(zhí)行同樣的操作也會(huì)出錯(cuò)!現(xiàn)在我想以前進(jìn)行漢字替換實(shí)在是太幸運(yùn)了!可能是那個(gè)時(shí)候進(jìn)行的漢字替換都是比較長(zhǎng)的漢字串吧,不太容易出現(xiàn)以上的情況。即使沒有出問題,也要知道那是不安全的!
問題是有的,工作還要繼續(xù)做,克服的困難也就::::現(xiàn)在的自我了。
好在想起一組PHP的擴(kuò)展模塊,Multibyte String Functions,添加許多支持多字節(jié)文字的操作的函數(shù),如:ereg_replace() 對(duì)應(yīng)著mb_ereg_replace() 等等。具體的函數(shù)說明請(qǐng)查詢相關(guān)的文章。
總結(jié):對(duì)于中文漢字安全的操作最好是使用Multibyte String Functions。

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

One ascii character occupies 1 byte. ASCII code characters are represented by 7-bit or 8-bit binary encoding in the computer and are stored in one byte, that is, one ASCII code occupies one byte. ASCII code can be divided into standard ASCII code and extended ASCII code. Standard ASCII code is also called basic ASCII code. It uses 7-bit binary numbers (the remaining 1 binary digit is 0) to represent all uppercase and lowercase letters, and the numbers 0 to 9. Punctuation marks, and special control characters used in American English.

ASCII value conversion in PHP is a problem often encountered in programming. ASCII (American Standard Code for Information Interchange) is a standard encoding system for converting characters into numbers. In PHP, we often need to convert between characters and numbers through ASCII code. This article will introduce how to convert ASCII values ??in PHP and give specific code examples. 1. Change the characters

MySQL is a commonly used relational database management system that provides a variety of functions to process and operate data. Among them, the REPLACE function is used to replace the specified part of the string. In this article, we will introduce how to use the REPLACE function for string replacement in MySQL and demonstrate its usage through code examples. First, let’s take a look at the syntax of the REPLACE function: REPLACE(str,search_str,replace_str).

What are the string search and replace techniques in Python? (Specific code example) In Python, strings are a common data type, and we often encounter string search and replace operations in daily programming. This article will introduce some common string search and replacement techniques, accompanied by specific code examples. To find a specific substring in a string, you can use the find() method or index() method of the string. The find() method returns the index of the first occurrence of the substring in the string.

This article will explain in detail the ASCII value of the first character of the string returned by PHP. The editor thinks it is very practical, so I share it with you as a reference. I hope you can gain something after reading this article. PHP returns the ASCII value of the first character of a string Introduction In PHP, getting the ASCII value of the first character of a string is a common operation that involves basic knowledge of string processing and character encoding. ASCII values ??are used to represent the numeric value of characters in computer systems and are critical for character comparison, data transmission and storage. The process of getting the ASCII value of the first character of a string involves the following steps: Get String: Determine the string for which you want to get the ASCII value. It can be a variable or a string constant

php提交表單通過后,彈出的對(duì)話框怎樣在當(dāng)前頁彈出php提交表單通過后,彈出的對(duì)話框怎樣在當(dāng)前頁彈出而不是在空白頁彈出?想實(shí)現(xiàn)這樣的效果:而不是空白頁彈出:------解決方案--------------------如果你的驗(yàn)證用PHP在后端,那么就用Ajax;僅供參考:HTML code

The differences between unicode and ascii include different encoding ranges, different storage spaces, and different compatibility. Detailed introduction: 1. The encoding range is different. The encoding range of ASCII is 0-127, which is mainly used to represent English letters. The encoding range of Unicode is much wider and can represent almost all language characters; 2. The storage space is different. ASCII usually Use 1 byte to store a character, while unicode may use 2 or more bytes to store a character; 3. Different compatibility, etc.

"How to accurately convert PHP strings to ASCII codes, specific code examples are needed" In the field of programming, ASCII (American Standard Code for Information Interchange) code is the standard encoding system used to represent characters in computer systems. In PHP, we often need to convert strings into ASCII codes for some operations or processing. Here's how to accurately convert a string to ASCII in PHP
