Logo Advies en zo, Meedenken en -doen

Portable php-mysql connection charset fix

The problem

I was building a website with a PHP MySQL backend which was serving the webpages in the utf-8 character set to the browser.
Following all the UTF-8 guidelines which you can find anywhere on the web, I had already made sure that:

Unfortunately my string Itrntinliztin still came up as I?t?rn?ti?n?liz?ti?n.

The need for a portable solution

Even though I knew that the website I was working on would be using utf-8, in other websites I might well be working with other character sets. As the basis of my back-end was set up to be portable and re-usable, I would need a portable solution to this problem.

The assumption I make here is that the web output of the application will always use the same character set as the character set I use in the database. So we presume that the database has been set up properly using the character set we want to use for the whole application.

What went wrong ?

I found the cause of my problem - after doing many hours of reading - to be a default PHP setting:

A PHP mysql connection (...) defaults to a latin1 connection, so, your first query after connection should be:

mysql_query("SET NAMES 'utf8'");

From the PHP WACT pages on utf-8 and mysql

So now it was time to make sure that this was really the cause of the problem and I started testing the values for charset and collation I would get with a default php-mysql connection.

Original MySQL variables

mysql> SHOW VARIABLES LIKE 'c%';

+-------------------------------+-------------------------------+
| Variable_name			| Value				|
+-------------------------------+-------------------------------+
| character_set_client		| latin1			|
| character_set_connection	| latin1			|
| character_set_database	| utf8				|
| character_set_results		| latin1			|
| character_set_server		| latin1			|
| character_set_system		| utf8				|
| collation_connection		| latin1_swedish_ci		|
| collation_database		| utf8_general_ci		|
| collation_server		| latin1_general_ci		|
+-------------------------------+-------------------------------+

The bold values are all "wrong" or may be I should say "undesired". The database character set and the database collation values (in blue) are the ones which should be leading.

Explanation of the variables

character_set_server and collation_server

The server character set and collation can be determined from the values of the character_set_server and collation_server system variables.

Ok, those are the defaults of the server, but we already overruled those when setting up the database, so for this exercise they are not really relevant.

character_set_database and collation_database

The character set and collation of the default database can be determined from the values of the character_set_database and collation_database system variables.

Those are the defaults of the database we selected. These are the ones we want to use.

character_set_client

What character set is the statement in when it leaves the client?

The server takes the character_set_client system variable to be the character set in which statements are sent by the client.

The client in this case is the PHP script. As the sql statement which is send by the PHP script to the server may contain utf-8 characters - for instance form input which (after validation) is used in an sql query -, the client character set needs to use the same character set as the database.

character_set_connection and collation_connection

What character set should the server translate a statement to after receiving it?

For this, the server uses the character_set_connection and collation_connection system variables. It converts statements sent by the client from character_set_client to character_set_connection (except for string literals that have an introducer such as _latin1 or _utf8).

collation_connection is important for comparisons of literal strings. For comparisons of strings with column values, collation_connection does not matter because columns have their own collation, which has a higher collation precedence.

If we would need sql statements to be translated between character sets when sending them from PHP to MySQL, this is where we would need to make any changes. However, for a stable application, using the same character set in the database as in the web output of an application is advisable.

This means for all practical purposes that the connection character set variables need to use the same character set as the database.

character_set_results

What character set should the server translate to before shipping result sets or error messages back to the client?

The character_set_results system variable indicates the character set in which the server returns query results to the client. This includes result data such as column values, and result metadata such as column names.

The results sets (for instance from a SELECT statement) will normally be used by the web application for display on the screen, so - again - for all practical purposes this should be set to the same character set as the database, i.e. the same character set which is used throughout the application.

Quotes taken from the MySQL manual page on Connection charsets and collations

Word of warning on using names of character sets

For using the charset in PHP and sending the charset to the client's browser we would use utf-8 (i.e. for instance for use in the htmlspecialchars function or in the header as shown above).

For setting the charset in MySQL we need to set it to utf8 (take note: no dash).

So even if we have a variable defined which holds the default charset for the application to use throughout the code, we cannot use this to send a change to MySQL as it has it's own naming convention for charsets.

For more information on the character set names used in MySQL, please refer to the MySQL Manual on Charsets and Collations.

Testing Stuart's method

Ok, so now I knew the cause, I could start looking for a solution.

For this database I was in a position to change the mysql defaults, but doing it that way would not create a portable solution as I knew very well I couldn't touch the mysql settings on any of the shared servers which a lot of my customers use.

Stuart Herbert turned out to have been working on a similar problem involving conflicting collations, so I first tried out his proposed solution:

mysql_query("set collation_connection = @@collation_database");

Resulting MySQL variables after changing the collation using this method

mysql> SHOW VARIABLES LIKE 'c%';

+-------------------------------+-------------------------------+
| Variable_name			| Value				|
+-------------------------------+-------------------------------+
| character_set_client		| latin1			|
| character_set_connection	| utf8				|
| character_set_database	| utf8				|
| character_set_results		| latin1			|
| character_set_server		| latin1			|
| character_set_system		| utf8				|
| collation_connection		| utf8_general_ci		|
| collation_database		| utf8_general_ci		|
| collation_server		| latin1_general_ci		|
+-------------------------------+-------------------------------+

Better, but not quite what I wanted.... and my text still came up as I?t?rn?ti?n?liz?ti?n.

Creating a portable solution

So, I would need to use SET NAMES after all, but, to use SET NAMES, I would need to provide it with a character set name.

What does SET NAMES do ?

A SET NAMES 'x' statement is equivalent to these three statements:

SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;

Setting character_set_connection to x also sets collation_connection to the default collation for x. To specify one of the character set's collations explicitly, use the optional COLLATE clause:

SET NAMES 'charset_name' COLLATE 'collation_name'

From the MySQL manual page on Connection charsets and collations

The solution

Tests using a statement such as the following examples inspired by Stuart Herbert gave error messages as the sql statement didn't seem to interpret the @@character_set_database properly.

SET NAMES @@character_set_database
SET NAMES '@@character_set_database'

Through the SHOW VARIABLES statement I could access the value of character_set_database, even though it would be a roundabout way - getting the variable and then passing it back -, it would make for a portable solution. So the code needed was:

$db_charset = mysql_query( "SHOW VARIABLES LIKE 'character_set_database'" );
$charset_row = mysql_fetch_assoc( $db_charset );
mysql_query( "SET NAMES '" . $charset_row['Value'] . "'" );
unset( $db_charset, $charset_row );

Let's see what that would get us..

Resulting MySQL variables after changing the character set through Set Names

mysql> SHOW VARIABLES LIKE 'c%';

+-------------------------------+-------------------------------+
| Variable_name			| Value				|
+-------------------------------+-------------------------------+
| character_set_client		| utf8				|
| character_set_connection	| utf8				|
| character_set_database	| utf8				|
| character_set_results		| utf8				|
| character_set_server		| latin1			|
| character_set_system		| utf8				|
| collation_connection		| utf8_general_ci		|
| collation_database		| utf8_general_ci		|
| collation_server		| latin1_general_ci		|
+-------------------------------+-------------------------------+

Yup ! Result achieved and they lived happily ever after...

Code Examples using various standard database management classes

Using standard PHP MySQL functions

$db = mysql_connect( $dbhost, $dbuser, $dbpasswd );
mysql_select_db( $dbname, $db);

// Make sure any results we retrieve or commands we send use the same charset and collation as the database:
$db_charset = mysql_query( "SHOW VARIABLES LIKE 'character_set_database'" );
$charset_row = mysql_fetch_assoc( $db_charset );
mysql_query( "SET NAMES '" . $charset_row['Value'] . "'" );
unset( $db_charset, $charset_row );

Using ADODB

$db->Connect( $dbhost, $dbuser, $dbpasswd, $dbname );

// Make sure any results we retrieve or commands we send use the same charset and collation as the database:
$db_charset = $db->Execute( "SHOW VARIABLES LIKE 'character_set_database'" );
$db->Execute( "SET NAMES '" . $db_charset->fields['Value'] . "'" )
$db_charset->Close(); // Optional

PHP MySQL compliancy

I have not exhaustively tested this fix in all possible php and mysql versions. However as it uses pretty standard code, it should work fine in all versions which support UTF-8, i.e. MySQL v 4.1.0 and higher.

If you have tested this fix in other PHP / MySQL versions or even using different database systems if the problem would occur in those too, please let me know and I will start making a list.

If you use another standard database connection class and use this fix, please send me a code sample and I'll add it to this page.

Known bugs as of June 18nd 2006

None so far.

Further reading:

Contact me:

For bug reports, compliance reports or raving enthousiastic thank-you mails, you can contact me by e-mail:

Valid XHTML 1.0 Transitional code ! Correct CSS code !