September 9th, 2011, 12:12 PM
Recovering wrongly encoded database entries
I was not really sure where to put my thread, but I figured this forum would be the best choice.
I received a project from a customer. It's a website in 3 languages (english, russian, german). The customer was using a TYPO3 system to manage his website. Unfortunately the TYPO3 system hasn't been setup correctly.
Normally I would expect a complete UTF-8 setup for such a cross language site. But the database (MySQL) tables use "latin1_swedish_ci" collation - which is not a good thing if you want to use cyrillic... Furthermore the TYPO3 system has been configured to enforce UTF-8 format.
Which results in a very strange encoding.
Here is an example:
russian (cyrillic) - frontend:
db entry via phpmyadmin:
Посетив нашу столицу, не удивляйтесь, обнаружив, что Ереван такой же древний, как и Рим “Вечный Город”. Удостоверится в этом, вы сможете, посетив Музей Эребуни города-крепости, где и было поставлено начало нашей столицы. Столица Республики Армения один из самых древних и красивых городов мира. Ереван – сокровищница множества архитектурных памятников. Вот лишь некоторые из них. Эти достопримечательности мы рекомендуем вам осмотреть прежде всего. Приглашаем вас на увлекательную прогулку по Еревану.
ÐŸÐ¾Ñ?ÐµÑ‚Ð¸Ð² Ð½Ð°ÑˆÑƒ Ñ?Ñ‚Ð¾Ð»Ð¸Ñ†Ñƒ, Ð½Ðµ ÑƒÐ´Ð¸Ð²Ð»Ñ?Ð¹Ñ‚ÐµÑ?ÑŒ, Ð¾Ð±Ð½Ð°Ñ€ÑƒÐ¶Ð¸Ð², Ñ‡Ñ‚Ð¾ Ð•Ñ€ÐµÐ²Ð°Ð½ Ñ‚Ð°ÐºÐ¾Ð¹ Ð¶Ðµ Ð´Ñ€ÐµÐ²Ð½Ð¸Ð¹, ÐºÐ°Ðº Ð¸ Ð Ð¸Ð¼ “Ð’ÐµÑ‡Ð½Ñ‹Ð¹ Ð“Ð¾Ñ€Ð¾Ð´â€?. Ð£Ð´Ð¾Ñ?Ñ‚Ð¾Ð²ÐµÑ€Ð¸Ñ‚Ñ?Ñ? Ð² Ñ?Ñ‚Ð¾Ð¼, Ð²Ñ‹ Ñ?Ð¼Ð¾Ð¶ÐµÑ‚Ðµ, Ð¿Ð¾Ñ?ÐµÑ‚Ð¸Ð² ÐœÑƒÐ·ÐµÐ¹ Ð*Ñ€ÐµÐ±ÑƒÐ½Ð¸ Ð³Ð¾Ñ€Ð¾Ð´Ð°-ÐºÑ€ÐµÐ¿Ð¾Ñ?Ñ‚Ð¸, Ð³Ð´Ðµ Ð¸ Ð±Ñ‹Ð»Ð¾ Ð¿Ð¾Ñ?Ñ‚Ð°Ð²Ð»ÐµÐ½Ð¾ Ð½Ð°Ñ‡Ð°Ð»Ð¾ Ð½Ð°ÑˆÐµÐ¹ Ñ?Ñ‚Ð¾Ð»Ð¸Ñ†Ñ‹. Ð¡Ñ‚Ð¾Ð»Ð¸Ñ†Ð° Ð ÐµÑ?Ð¿ÑƒÐ±Ð»Ð¸ÐºÐ¸ Ð?Ñ€Ð¼ÐµÐ½Ð¸Ñ? Ð¾Ð´Ð¸Ð½ Ð¸Ð· Ñ?Ð°Ð¼Ñ‹Ñ… Ð´Ñ€ÐµÐ²Ð½Ð¸Ñ… Ð¸ ÐºÑ€Ð°Ñ?Ð¸Ð²Ñ‹Ñ… Ð³Ð¾Ñ€Ð¾Ð´Ð¾Ð² Ð¼Ð¸Ñ€Ð°. Ð•Ñ€ÐµÐ²Ð°Ð½ – Ñ?Ð¾ÐºÑ€Ð¾Ð²Ð¸Ñ‰Ð½Ð¸Ñ†Ð° Ð¼Ð½Ð¾Ð¶ÐµÑ?Ñ‚Ð²Ð° Ð°Ñ€Ñ…Ð¸Ñ‚ÐµÐºÑ‚ÑƒÑ€Ð½Ñ‹Ñ… Ð¿Ð°Ð¼Ñ?Ñ‚Ð½Ð¸ÐºÐ¾Ð². Ð’Ð¾Ñ‚ Ð»Ð¸ÑˆÑŒ Ð½ÐµÐºÐ¾Ñ‚Ð¾Ñ€Ñ‹Ðµ Ð¸Ð· Ð½Ð¸Ñ…. Ð*Ñ‚Ð¸ Ð´Ð¾Ñ?Ñ‚Ð¾Ð¿Ñ€Ð¸Ð¼ÐµÑ‡Ð°Ñ‚ÐµÐ»ÑŒÐ½Ð¾Ñ?Ñ‚Ð¸ Ð¼Ñ‹ Ñ€ÐµÐºÐ¾Ð¼ÐµÐ½Ð´ÑƒÐµÐ¼ Ð²Ð°Ð¼ Ð¾Ñ?Ð¼Ð¾Ñ‚Ñ€ÐµÑ‚ÑŒ Ð¿Ñ€ÐµÐ¶Ð´Ðµ Ð²Ñ?ÐµÐ³Ð¾. ÐŸÑ€Ð¸Ð³Ð»Ð°ÑˆÐ°ÐµÐ¼ Ð²Ð°Ñ? Ð½Ð° ÑƒÐ²Ð»ÐµÐºÐ°Ñ‚ÐµÐ»ÑŒÐ½ÑƒÑŽ Ð¿Ñ€Ð¾Ð³ÑƒÐ»ÐºÑƒ Ð¿Ð¾ Ð•Ñ€ÐµÐ²Ð°Ð½Ñƒ.
I really don't have a clue how to migrate this content to a new system running a UTF-8 database...
Does anyone have an idea if or how you could recover this?
thanks a lot in advance and kind regards
September 9th, 2011, 04:00 PM
Yes, recovery should be possible. Is your new database MySQL also? Does it have the same schema (besides the character encoding)?
September 9th, 2011, 08:53 PM
yes, the new db is also a mysql database.
And the schema is also remaining the same. Some slight updates will be done to some of the tables, as there will be a new TYPO3 version in place. But this can also be done after the encoding problem is resolved.
The tables collations are set to utf8_general_ci at the moment.
What can I do to convert it?
And thanks in advance!
September 15th, 2011, 04:34 AM
did you have a chance to look at that problem?
I am still stuck on it. :/
September 15th, 2011, 10:49 PM
This is a difficult problem to analyze abstractly because there are a lot of different stages at which conversion could occur.
I'm assuming the actual data itself is encoded in UTF-8 already, and the database is simply pretending that the data is latin1 encoded when it isn't. So ideally I think you would want to read the data out of the existing database with a latin1 connection, and then insert it into the new database with a UTF-8 connection without performing any actual conversion on it.
I assume the front-end on your old system delivers data as UTF-8 to the browser, which means that at some point the content from the database is either already UTF-8 or is being converted to UTF-8 before being sent to the browser. You need to do exactly the same then, except instead of sending it to the browser, you send it to the new database.
September 16th, 2011, 02:40 AM
thanks for your reply. After about 6 hours of investigating and trial & error on this matter, I was finally able to migrate the data to a new system.
As you already assumed the data in the database was utf8 encoded. Running utf8_decode via php (eg) resulted in a proper output.
However I realized when doing a mysqldump on the server with characters set to latin1 that the dump file contained all foreign characters in the correct format. So what I did was to search & replace all instances of "latin1" and replace them with utf8. After that I imported the dump on a fresh, empty database and it worked.
I am not exactly sure why it worked yesterday evening, because I already tried that twice the other day, but well... at least it works.