#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2002
    Location
    Germany
    Posts
    229
    Rep Power
    13

    Recovering wrongly encoded database entries


    Hi all,

    I was not really sure where to put my thread, but I figured this forum would be the best choice.

    I received a project from a customer. It's a website in 3 languages (english, russian, german). The customer was using a TYPO3 system to manage his website. Unfortunately the TYPO3 system hasn't been setup correctly.

    Normally I would expect a complete UTF-8 setup for such a cross language site. But the database (MySQL) tables use "latin1_swedish_ci" collation - which is not a good thing if you want to use cyrillic... Furthermore the TYPO3 system has been configured to enforce UTF-8 format.

    Which results in a very strange encoding.

    Here is an example:

    russian (cyrillic) - frontend:

    Code:
    Посетив нашу столицу, не удивляйтесь, обнаружив, что Ереван такой же древний, как и Рим Вечный Город. Удостоверится в этом, вы сможете, посетив Музей Эребуни города-крепости, где и было поставлено начало нашей столицы. Столица Республики Армения один из самых древних и красивых городов мира. Ереван  сокровищница множества архитектурных памятников. Вот лишь некоторые из них. Эти достопримечательности мы рекомендуем вам осмотреть прежде всего. Приглашаем вас на увлекательную прогулку по Еревану.
    db entry via phpmyadmin:

    Code:
    По?етив нашу ?толицу, не удивл?йте?ь, обнаружив, что Ереван такой же древний, как и  им Вечный Город?. Удо?товерит?? в ?том, вы ?можете, по?етив Музей *ребуни города-крепо?ти, где и было по?тавлено начало нашей ?толицы. Столица  е?публики ?рмени? один из ?амых древних и кра?ивых городов мира. Ереван  ?окровищница множе?тва архитектурных пам?тников. Вот лишь некоторые из них. *ти до?топримечательно?ти мы рекомендуем вам о?мотреть прежде в?его. Приглашаем ва? на увлекательную прогулку по Еревану.

    I really don't have a clue how to migrate this content to a new system running a UTF-8 database...

    Does anyone have an idea if or how you could recover this?

    thanks a lot in advance and kind regards

    Sven
  2. #2
  3. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    Yes, recovery should be possible. Is your new database MySQL also? Does it have the same schema (besides the character encoding)?
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2002
    Location
    Germany
    Posts
    229
    Rep Power
    13
    yes, the new db is also a mysql database.

    And the schema is also remaining the same. Some slight updates will be done to some of the tables, as there will be a new TYPO3 version in place. But this can also be done after the encoding problem is resolved.

    The tables collations are set to utf8_general_ci at the moment.

    What can I do to convert it?

    And thanks in advance!
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2002
    Location
    Germany
    Posts
    229
    Rep Power
    13
    did you have a chance to look at that problem?

    I am still stuck on it. :/
  8. #5
  9. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,316
    Rep Power
    7170
    This is a difficult problem to analyze abstractly because there are a lot of different stages at which conversion could occur.

    I'm assuming the actual data itself is encoded in UTF-8 already, and the database is simply pretending that the data is latin1 encoded when it isn't. So ideally I think you would want to read the data out of the existing database with a latin1 connection, and then insert it into the new database with a UTF-8 connection without performing any actual conversion on it.

    I assume the front-end on your old system delivers data as UTF-8 to the browser, which means that at some point the content from the database is either already UTF-8 or is being converted to UTF-8 before being sent to the browser. You need to do exactly the same then, except instead of sending it to the browser, you send it to the new database.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Sep 2002
    Location
    Germany
    Posts
    229
    Rep Power
    13
    thanks for your reply. After about 6 hours of investigating and trial & error on this matter, I was finally able to migrate the data to a new system.

    As you already assumed the data in the database was utf8 encoded. Running utf8_decode via php (eg) resulted in a proper output.

    However I realized when doing a mysqldump on the server with characters set to latin1 that the dump file contained all foreign characters in the correct format. So what I did was to search & replace all instances of "latin1" and replace them with utf8. After that I imported the dump on a fresh, empty database and it worked.

    I am not exactly sure why it worked yesterday evening, because I already tried that twice the other day, but well... at least it works.

IMN logo majestic logo threadwatch logo seochat tools logo