Core points
- Although PHP is able to handle multi-byte variable names and Unicode strings, the language lacks comprehensive Unicode support because of treating strings as single-byte character sequences. This limitation affects all aspects of string operation, including substring extraction, determining string length, and string segmentation.
- Portable UTF-8 is a user space library that brings Unicode support to PHP applications. It is built on top of mbstring and iconv, provides about 60 Unicode-based string manipulation, testing and verification functions, and uses UTF-8 as its main character encoding scheme. The library is fully portable and can be used with any PHP 4.2 or later installation.
- Portable UTF-8 library provides multiple functions for processing Unicode strings, including UTF-8 input verification, removing invalid bytes, encoding text into HTML entities to prevent XSS attacks, trimming spaces, removing duplicate spaces, creating inclusions UTF-8 characters URL fragments and forced limits on input character length. This ensures that in Unicode-enabled applications, the focus shifts from byte and byte lengths to character and character lengths.
PHP allows multi-byte variable names (e.g. $a∩b
, $?xy
and $Δx
), mbstring
and other extensions can handle Unicode strings, and utf8_encode()
and utf8_decode()
functions can be used in UTF Convert strings between -8 and ISO-8859-1 encoding. However, it is widely believed that PHP lacks Unicode support. This article describes the meaning of lack of Unicode support and demonstrates how to use a library that brings Unicode support to PHP applications - Portable UTF-8.
Unicode support in PHP
PHP's lack of Unicode/multi-byte support means that standard string processing functions treat strings as single-byte character sequences. In fact, the official PHP manual defines a string in PHP as "a series of characters, one of which is the same as a byte". PHP supports only 8-bit characters, while Unicode (and many other character sets) may require multiple bytes to represent a character. This limitation of PHP affects almost all aspects of string operation, including (but not limited to) substring extraction, determining string length, string segmentation, mixing and so on. Efforts to solve this problem began in early 2005, but in 2010, the work of bringing native Unicode support to PHP was stopped and put on hold for a variety of reasons. Since native Unicode support in PHP can take years to implement (if it does), developers must rely on available extensions such as mbstring
and iconv
to fill this gap, but these extensions offer only limited Unicode support. These libraries are not Unicode-centric and can also be converted between non-Unicode encodings. They make positive contributions to simplifying Unicode string processing. However, the above extension also has some disadvantages. They only provide limited Unicode string processing capabilities, and none of them are enabled by default. Server administrators must explicitly enable any or all extensions to access them through PHP applications. Shared hosting providers often make things worse by installing one or two extensions, which makes it difficult for developers to rely on an always-available API to meet their Unicode needs. Still, the good news is that PHP can output Unicode text. This is because PHP doesn't really care whether we are sending English text encoded in ASCII or other text belonging to the language whose characters are encoded in multiple bytes. Knowing this, PHP developers now only need an API that provides comfortable Unicode-based string manipulation.
Portable UTF-8
The recent solution is to create a user space library written in PHP. Even if the server/language level lacks support, these libraries can be easily bundled with the application to ensure the presence of Unicode support. Many open source applications already include their own libraries of this kind, and many more use free third-party libraries; Portable UTF-8 is such a library. Portable UTF-8 is a free lightweight library built on top of mbstring
and iconv
. It extends the functionality of these two extensions, providing about 60 Unicode-based string manipulation, testing and verification functions; it provides UTF-8-aware corresponding functions for nearly all PHP common string handling functions. As the name implies, Portable UTF-8 uses UTF-8 as its primary character encoding scheme. The library uses available extensions (mbstring
and iconv
) for speed reasons and bridges some inconsistencies when using them directly, but if there are no these extensions on the server, it falls back to using pure PHP A UTF-8 routine written. Portable-UT8 is fully portable and can be used with any PHP 4.2 or later installation.
Stand processing using Portable UTF-8
Text editors with poor Unicode support can corrupt text when reading text, and text copied and pasted into web forms from such an editor may be the source of invalid UTF-8 for the application. When processing user-submitted input, be sure to make sure the input is exactly in line with the application's expectations. To detect whether the text is valid UTF-8, you can use the library's is_utf8()
function.
if (is_utf8($_POST['title'])) { // 執(zhí)行某些操作... }
Recovering characters from invalid bytes is impossible, so removing bytes that are not recognized as valid UTF-8 characters may be your only choice. The utf8_clean()
function can be used to remove invalid bytes.
$title = utf8_clean($_POST['title']);
Each Unicode character can be encoded as the corresponding HTML entity, and you may want to encode the text in this way to help prevent XSS attacks before outputting it to the browser.
echo utf8_html_encode($title);
Usually, spaces are trimmed at the beginning and end of a string. Unicode lists about 20 space characters, and some ASCII-based control characters should also be considered objects that need to be pruned.
$title = utf8_trim($title);
On the other hand, duplicates of such spaces may exist in the middle of a string and should be deleted. The following shows how to use utf8_remove_duplicates()
and utf8_ws()
in combination:
$title = utf8_remove_duplicates($title, utf8_ws());
The traditional solution for creating URL fragments for SEO purposes uses transliteration and removes all non-ASCII characters from the fragment. This makes the URL less valuable than it is. While the URL can support UTF-8 encoded characters, without such removal or transliteration, we can create rich snippets containing characters in any language:
$slug = utf8_url_slug($title, 30); // 字符長度30
From the start of input verification to saving data to a database, Unicode-enabled applications focus on character and character lengths, not byte and byte lengths. This shift in focus requires a new interface to understand this difference. It is usually necessary to limit the length of the input character, so if the input is more than 60 characters in length, we will create a substring.
if (utf8_strlen($title) > 60) { $title = utf8_substr($title, 0, 60); }
Or:
if (!utf8_fits_inside($title , 60)) { $title = utf8_substr($title, 0 ,60); }
There are three different ways to access a single character using the Portable-UT8 library. We can use utf8_access()
to access a single character.
echo '第六個字符是:' . utf8_access($string, 5);
utf8_chr_map()
Allows iterative access of a single character using a callback function.
utf8_chr_map('some_callback', $string);
We can split the string into a character array using utf8_split()
and process the array elements as a single character.
array_map('some_callback', utf8_split($string));
Training Unicode may also require us to find the minimum/maximum code point in the string, segment the string, process byte order markers, string case conversion, randomization/mixing, replacement, etc. All of this is supported by Portable-UT8.
Conclusion
PHP 6 development has been stopped, resulting in the long-term need for native Unicode support being delayed, which is crucial for the development of multilingual applications. Therefore, server-side extensions and user space libraries such as Portable UTF-8 play an important role in helping developers create better standardized webs to meet local needs.
(The FAQs part is omitted here due to space limitations)
The above is the detailed content of Bringing Unicode to PHP with Portable UTF-8. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

To determine the strength of the password, it is necessary to combine regular and logical processing. The basic requirements include: 1. The length is no less than 8 digits; 2. At least containing lowercase letters, uppercase letters, and numbers; 3. Special character restrictions can be added; in terms of advanced aspects, continuous duplication of characters and incremental/decreasing sequences need to be avoided, which requires PHP function detection; at the same time, blacklists should be introduced to filter common weak passwords such as password and 123456; finally it is recommended to combine the zxcvbn library to improve the evaluation accuracy.

Common problems and solutions for PHP variable scope include: 1. The global variable cannot be accessed within the function, and it needs to be passed in using the global keyword or parameter; 2. The static variable is declared with static, and it is only initialized once and the value is maintained between multiple calls; 3. Hyperglobal variables such as $_GET and $_POST can be used directly in any scope, but you need to pay attention to safe filtering; 4. Anonymous functions need to introduce parent scope variables through the use keyword, and when modifying external variables, you need to pass a reference. Mastering these rules can help avoid errors and improve code stability.

To safely handle PHP file uploads, you need to verify the source and type, control the file name and path, set server restrictions, and process media files twice. 1. Verify the upload source to prevent CSRF through token and detect the real MIME type through finfo_file using whitelist control; 2. Rename the file to a random string and determine the extension to store it in a non-Web directory according to the detection type; 3. PHP configuration limits the upload size and temporary directory Nginx/Apache prohibits access to the upload directory; 4. The GD library resaves the pictures to clear potential malicious data.

There are three common methods for PHP comment code: 1. Use // or # to block one line of code, and it is recommended to use //; 2. Use /.../ to wrap code blocks with multiple lines, which cannot be nested but can be crossed; 3. Combination skills comments such as using /if(){}/ to control logic blocks, or to improve efficiency with editor shortcut keys, you should pay attention to closing symbols and avoid nesting when using them.

AgeneratorinPHPisamemory-efficientwaytoiterateoverlargedatasetsbyyieldingvaluesoneatatimeinsteadofreturningthemallatonce.1.Generatorsusetheyieldkeywordtoproducevaluesondemand,reducingmemoryusage.2.Theyareusefulforhandlingbigloops,readinglargefiles,or

The key to writing PHP comments is to clarify the purpose and specifications. Comments should explain "why" rather than "what was done", avoiding redundancy or too simplicity. 1. Use a unified format, such as docblock (/*/) for class and method descriptions to improve readability and tool compatibility; 2. Emphasize the reasons behind the logic, such as why JS jumps need to be output manually; 3. Add an overview description before complex code, describe the process in steps, and help understand the overall idea; 4. Use TODO and FIXME rationally to mark to-do items and problems to facilitate subsequent tracking and collaboration. Good annotations can reduce communication costs and improve code maintenance efficiency.

ToinstallPHPquickly,useXAMPPonWindowsorHomebrewonmacOS.1.OnWindows,downloadandinstallXAMPP,selectcomponents,startApache,andplacefilesinhtdocs.2.Alternatively,manuallyinstallPHPfromphp.netandsetupaserverlikeApache.3.OnmacOS,installHomebrew,thenrun'bre

TolearnPHPeffectively,startbysettingupalocalserverenvironmentusingtoolslikeXAMPPandacodeeditorlikeVSCode.1)InstallXAMPPforApache,MySQL,andPHP.2)Useacodeeditorforsyntaxsupport.3)TestyoursetupwithasimplePHPfile.Next,learnPHPbasicsincludingvariables,ech
