How to Decode and Correctly Display UTF-8 and ISO Encoded Text

If you have ever worked with email or NNTP (news) messages you'll know that From: and Subject: headers don't always contain human-readable information, for example:

=?ISO-8859-9?Q?Yeni_Tak=FDm_Arkada=FElar=FD_Ar=FDyoruz?=
=?Utf-8?B?VGFyaXF1ZQ==?= =?Utf-8?B?2KfYsdiv2Yg=?=
=?koi8-r?B?+sHLwdog19DV08vOz8fPIMvPzMzFy9TP0sEgzsEg0MXWzyAzMDc=?=

The gibberish above is encoded for transmission over systems that can only deal with US ASCII characters. What is need is a conversion of the coded text into Unicode so that it can be displayed in its original form, like this:

Yeni Takım Arkadaşları Arıyoruz
Tariqueاردو
Заказ впускного коллектора на пежо 307

Disclaimer: Note that the coded text used in this article was taken from various email messages that may or may not have been spam. kadaitcha,cx is not responsible for any rude bits because the author can't read or write in anything other than English.

Walkthrough

Break Apart

There are several problems to solve when trying to decode UTF/ISO encoded text. The first problem is determining what to decode. Often string sequences contain multiple sets of encodings, which are signified by multiple copies of "?= =?" delimiters in the string. You cannot simply pass the coded text into a decoder and have human-readable text pop out. Each set of text needs to be stripped out of its delimiters before decoding.

To make matters worse, the text may be Base64 encoded or it may contain individual code points, which are signified by the equals symbol and a two-digit hexadecimal value, for example, the continuous line of encoded text below contains 3 sets of UTF-8 delimiters and multiple individual code points such as =C3=B6:

=?Utf-8?Q?RE:_Hogyan_lehet_az_elm=C3=BAlt_pl._3?==?Utf-8?Q?_h=C3=B3nap_bejegyz=C3=A9seit_kitr=C3=B6lni_Ou?==?Utf-8?Q?tlookb?=

The above text needs to be broken down into sets of encoded text, like this:

=?Utf-8?Q?RE:_Hogyan_lehet_az_elm=C3=BAlt_pl._3?=

=?Utf-8?Q?_h=C3=B3nap_bejegyz=C3=A9seit_kit=C3=B6r=C3=B6lni_Ou?=

=?Utf-8?Q?tlookb?=

The code below breaks the input text into individual sets of encodings as shown above; it also builds a template that is used to reconstruct the decoded text in the correct order and format. The template for the example above is similar to "{0}{1}{2}". In most cases, the template is superfluous, but if the user has written the text on a standard keyboard then plain text may also be encountered in the sequence.

Imports System.Text
Imports System.Text.RegularExpressions

Public Class cHeaderDecoder

    Private Const _empty As String = "" ' A blank string, Char type won't work.
    Private Const _braceleft As Char = "{"
    Private Const _braceright As Char = "}"
    Private Const _question As Char = "?"
    Private Const _underscore As Char = "_"
    Private Const _space As Char = " "
    Private Const L_MARK_LEFT As String = "=?"
    Private Const L_MARK_RIGHT As String = "?="
    Private Const L_MARK_CODE As String = "?q?="

    Private Function BreakApartCodedText(ByVal TokenisedText As String, _
                                         ByRef ParameterText As String) _
                                                        As List(Of String)

        If (Not TokenisedText.Contains(L_MARK_LEFT)) OrElse _
                  (Not TokenisedText.Contains(L_MARK_RIGHT)) Then
            Return Nothing
        End If

        Dim iLeftTokenStartCut As Integer
        Dim iRightTokenStartCut As Integer
        Dim CutLength As Integer
        Dim WorkingText As New StringBuilder
        Dim ExtractedToken As String
        Dim ExtractedTokens As New List(Of String)
        Dim iZ As Integer

        ' Some email clients insert a superfluous space, remove it:
        WorkingText.Append(TokenisedText.Replace("?= =?", "?==?"))

        Do While True
            ' Find the starting token on the left
            iLeftTokenStartCut = WorkingText.ToString.IndexOf(L_MARK_LEFT)

            ' If the text is made up of code points rather than being base64 encoded
            ' and if the very first character of the text is a code point then we
            ' must avoid confusing parts of "?q?=" with L_MARK_RIGHT. If we don't
            ' check for this condition we will misinterpret the text that
            ' immediately follows after "?=", and the conversion will fail:
            If WorkingText.ToString.ToLower.Contains(L_MARK_CODE) Then
                ' We must locate the actual L_MARK_RIGHT
                Dim NewStartPoint As Integer
                ' Compute a new starting point
                NewStartPoint = WorkingText.ToString.ToLower.IndexOf(L_MARK_CODE) + 5
                ' Decide where to cut the text
                iRightTokenStartCut = WorkingText.ToString.IndexOf(L_MARK_RIGHT, _
                             NewStartPoint) + L_MARK_RIGHT.Length
            Else
                iRightTokenStartCut = WorkingText.ToString.IndexOf(L_MARK_RIGHT) + _
                             L_MARK_RIGHT.Length
            End If
            ' Decide how much to cut
            CutLength = ((iRightTokenStartCut + L_MARK_RIGHT.Length) - _
                             iLeftTokenStartCut) - L_MARK_RIGHT.Length
            ' Extract the tokenised text
            ExtractedToken = WorkingText.ToString.Substring(iLeftTokenStartCut, _
                                                                     CutLength)
            ' Add the extracted text to our output list
            ExtractedTokens.Add(ExtractedToken)
            ' Remove the text from our working copy so that we don't
            ' get stuck in an infinite loop:
            WorkingText.Remove(iLeftTokenStartCut, CutLength)
            ' Insert a mask for the template, eg {0}, to take the place
            ' of the extracted text
            WorkingText.Insert(iLeftTokenStartCut, _braceleft & iZ.ToString & _
                                                                 _braceright)
            iZ += 1

            ' Exit the loop if there are no more markers
            If (Not WorkingText.ToString.Contains(L_MARK_LEFT)) OrElse _
                             (Not WorkingText.ToString.Contains(L_MARK_RIGHT)) Then
                Exit Do
            End If

        Loop

        ' Return the template and the extracted tokens
        ParameterText = WorkingText.ToString
        Return ExtractedTokens

    End Function

End Class

Decode the Text

The next phase of the process is to decode, or rather re-encode, the text into either Unicode from Base64 or, if the text contains individual code points, into Western European:

Private Function DecodeToUnicode(ByVal sEncodedText As String, _
                                     ByRef EncodingName As String) As String

    Dim EncodedText As New StringBuilder
    Dim iStart As Integer ' Index
    Dim DecodedText As String
    Dim IsCodePointText As Boolean

    If sEncodedText.ToString.ToLower.IndexOf(L_MARK_LEFT) = -1 Then
        ' Nothing to do
        Return sEncodedText
    End If

    EncodedText.Append(sEncodedText)
    ' Extract the preamble
    iStart = EncodedText.ToString.ToLower.IndexOf(L_MARK_LEFT)
    If iStart = 0 Then
        ' Remove the first encoding name marker
        EncodedText.Remove(iStart, 2)
        iStart = EncodedText.ToString.ToLower.IndexOf(_question)
        If iStart <> -1 Then
            ' Extract the encoding name
            ' Must convert to lowercase because we are highly likely
            ' to get UTF, Utf, utf, ISO, iso, et al.
            EncodingName = EncodedText.ToString.ToLower.Substring(0, iStart)
            ' Now remove the encoding name, plus the second and third markers
            If EncodedText.ToString.ToLower.Substring(EncodingName.Length, 3) _
                                                = "?q?" Then
                IsCodePointText = True
            End If
            EncodedText.Remove(0, iStart + 3)
            ' Remove the closing mark
            EncodedText.Remove(EncodedText.Length - 2, 2)

            If Not IsCodePointText Then
                DecodedText = Encoding.GetEncoding( _
                                    EncodingName.ToLower).GetString( _
                                    Convert.FromBase64String(EncodedText.ToString))
                DecodedText = DecodedText.Replace(_underscore, _space)
                Return DecodedText
            Else
                ' Locate the code points
                Dim Filter As New Regex("=[\da-fA-F]{2}")
                Dim m As Match = Filter.Match(EncodedText.ToString)
                Dim WideChar As Char
                While m.Success
                    If m.ToString <> _empty Then
                        WideChar = ChrW("&H" & m.ToString.Substring(1, 2))
                        EncodedText.Replace(m.ToString, WideChar.ToString)
                    End If
                    m = m.NextMatch
                End While
                EncodedText = EncodedText.Replace(_underscore, _space)
                Return EncodeToWesternEuropean(EncodedText.ToString, _
                                               EncodingName)
            End If ' Not t IsCodePointText Then
        End If ' iStart <> -1 Then
    End If ' iStart = 0 Then

    ' We can't convert this string. Return the input
    Return EncodedText.ToString

End Function

Private Function EncodeToWesternEuropean(ByVal TextToEncode As String, _
                                         ByVal EncodingName As String) As String

    Dim SourceEncoding As Encoding = Encoding.GetEncoding(EncodingName.ToLower)
    Dim TargetEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")

    Return SourceEncoding.GetString(TargetEncoding.GetBytes(TextToEncode))

End Function

Rebuild the Text

The third and final step in the process is to replace each token in the template that was generated with the decoded/re-encoded text:

Public Function DecodeText(ByVal EncodedText As String) As String

    Dim EncodingName As String = Nothing
    Dim ParameterText As String = Nothing
    Dim WorkingText As String = Nothing
    Dim iIndex As Integer = 0
    Dim CodeTokens As List(Of String) = BreakApartCodedText(EncodedText, _
                                                            ParameterText)

    If CodeTokens Is Nothing Then
        ' The Unicode text is broken in some way
        Return EncodedText
    End If

    For Each CodeToken As String In CodeTokens
        WorkingText = ParameterText.Replace(_braceleft & iIndex.ToString & _
                                            _braceright, _
                                            DecodeToUnicode(CodeToken, _
                                                            EncodingName))
        ParameterText = WorkingText
        iIndex += 1
    Next

    Return WorkingText

End Function

The code samples above are part of a class. Download the sample project below to see the code working:

Download Sample Project

Here are some encodings you can use to paste into the text field in the sample project. The download file contains a number of different language encodings that you can test, including Cyrillic, Arabic, Japanese, Hebrew, Korean, Ukrainian, Chinese, Turkish, Thai, Greek, Pakistani and Hungarian. You will find them in comments inside the cHeaderDecoder class:

=?UTF-8?B?VmlzdGEg0ZYg0YPQutGA0LDRl9C90YHRjNC60LAg0LzQvtCy0LA=?=

=?GB2312?B?0MLPyruwzOKjrLvwsazNvMaso6zIpM7FxubA7aOsztLHwA==?==?GB2312?B?z8qjoaOh?=

=?Utf-8?B?zpHPgM6/z4PPhM6/zrvOriDOvM63zr3Phc68zqzPhM+Jzr0gzrzOtSDOny7OlS4=?=

	All code examples on this site have been developed for .Net Framework 3.5