How to Decode and Correctly Display UTF-8 and ISO Encoded Text
If you have ever worked with email or NNTP (news) messages you'll know that From: and Subject: headers don't always contain human-readable information, for example:
=?ISO-8859-9?Q?Yeni_Tak=FDm_Arkada=FElar=FD_Ar=FDyoruz?=
=?Utf-8?B?VGFyaXF1ZQ==?= =?Utf-8?B?2KfYsdiv2Yg=?=
=?koi8-r?B?+sHLwdog19DV08vOz8fPIMvPzMzFy9TP0sEgzsEg0MXWzyAzMDc=?=
The gibberish above is encoded for transmission over systems that can only deal with US ASCII characters. What is need is a conversion of the coded text into Unicode so that it can be displayed in its original form, like this:
Yeni Takım Arkadaşları Arıyoruz
Tariqueاردو
Заказ впускного коллектора на пежо 307
Disclaimer: Note that the coded text used in this article was taken from various email messages that may or may not have been spam. kadaitcha,cx is not responsible for any rude bits because the author can't read or write in anything other than English.
Walkthrough
Break Apart
There are several problems to solve when trying to decode UTF/ISO encoded text. The first problem is determining what to decode. Often string sequences contain multiple sets of encodings, which are signified by multiple copies of "?= =?" delimiters in the string. You cannot simply pass the coded text into a decoder and have human-readable text pop out. Each set of text needs to be stripped out of its delimiters before decoding.
To make matters worse, the text may be Base64 encoded or it may contain individual code points, which are signified by the equals symbol and a two-digit hexadecimal value, for example, the continuous line of encoded text below contains 3 sets of UTF-8 delimiters and multiple individual code points such as =C3=B6:
=?Utf-8?Q?RE:_Hogyan_lehet_az_elm=C3=BAlt_pl._3?==?Utf-8?Q?_h=C3=B3nap_bejegyz=C3=A9seit_kitr=C3=B6lni_Ou?==?Utf-8?Q?tlookb?=
The above text needs to be broken down into sets of encoded text, like this:
=?Utf-8?Q?RE:_Hogyan_lehet_az_elm=C3=BAlt_pl._3?=
=?Utf-8?Q?_h=C3=B3nap_bejegyz=C3=A9seit_kit=C3=B6r=C3=B6lni_Ou?=
=?Utf-8?Q?tlookb?=
The code below breaks the input text into individual sets of encodings as shown above; it also builds a template that is used to reconstruct the decoded text in the correct order and format. The template for the example above is similar to "{0}{1}{2}". In most cases, the template is superfluous, but if the user has written the text on a standard keyboard then plain text may also be encountered in the sequence.
Imports System.Text
Imports System.Text.RegularExpressions
Public Class cHeaderDecoder
Private Const _empty As String = "" ' A blank string, Char type won't work.
Private Const _braceleft As Char = "{"
Private Const _braceright As Char = "}"
Private Const _question As Char = "?"
Private Const _underscore As Char = "_"
Private Const _space As Char = " "
Private Const L_MARK_LEFT As String = "=?"
Private Const L_MARK_RIGHT As String = "?="
Private Const L_MARK_CODE As String = "?q?="
Private Function BreakApartCodedText(ByVal TokenisedText As String, _
ByRef ParameterText As String) _
As List(Of String)
If (Not TokenisedText.Contains(L_MARK_LEFT)) OrElse _
(Not TokenisedText.Contains(L_MARK_RIGHT)) Then
Return Nothing
End If
Dim iLeftTokenStartCut As Integer
Dim iRightTokenStartCut As Integer
Dim CutLength As Integer
Dim WorkingText As New StringBuilder
Dim ExtractedToken As String
Dim ExtractedTokens As New List(Of String)
Dim iZ As Integer
' Some email clients insert a superfluous space, remove it:
WorkingText.Append(TokenisedText.Replace("?= =?", "?==?"))
Do While True
' Find the starting token on the left
iLeftTokenStartCut = WorkingText.ToString.IndexOf(L_MARK_LEFT)
' If the text is made up of code points rather than being base64 encoded
' and if the very first character of the text is a code point then we
' must avoid confusing parts of "?q?=" with L_MARK_RIGHT. If we don't
' check for this condition we will misinterpret the text that
' immediately follows after "?=", and the conversion will fail:
If WorkingText.ToString.ToLower.Contains(L_MARK_CODE) Then
' We must locate the actual L_MARK_RIGHT
Dim NewStartPoint As Integer
' Compute a new starting point
NewStartPoint = WorkingText.ToString.ToLower.IndexOf(L_MARK_CODE) + 5
' Decide where to cut the text
iRightTokenStartCut = WorkingText.ToString.IndexOf(L_MARK_RIGHT, _
NewStartPoint) + L_MARK_RIGHT.Length
Else
iRightTokenStartCut = WorkingText.ToString.IndexOf(L_MARK_RIGHT) + _
L_MARK_RIGHT.Length
End If
' Decide how much to cut
CutLength = ((iRightTokenStartCut + L_MARK_RIGHT.Length) - _
iLeftTokenStartCut) - L_MARK_RIGHT.Length
' Extract the tokenised text
ExtractedToken = WorkingText.ToString.Substring(iLeftTokenStartCut, _
CutLength)
' Add the extracted text to our output list
ExtractedTokens.Add(ExtractedToken)
' Remove the text from our working copy so that we don't
' get stuck in an infinite loop:
WorkingText.Remove(iLeftTokenStartCut, CutLength)
' Insert a mask for the template, eg {0}, to take the place
' of the extracted text
WorkingText.Insert(iLeftTokenStartCut, _braceleft & iZ.ToString & _
_braceright)
iZ += 1
' Exit the loop if there are no more markers
If (Not WorkingText.ToString.Contains(L_MARK_LEFT)) OrElse _
(Not WorkingText.ToString.Contains(L_MARK_RIGHT)) Then
Exit Do
End If
Loop
' Return the template and the extracted tokens
ParameterText = WorkingText.ToString
Return ExtractedTokens
End Function
End Class
Decode the Text
The next phase of the process is to decode, or rather re-encode, the text into either Unicode from Base64 or, if the text contains individual code points, into Western European:
Private Function DecodeToUnicode(ByVal sEncodedText As String, _
ByRef EncodingName As String) As String
Dim EncodedText As New StringBuilder
Dim iStart As Integer ' Index
Dim DecodedText As String
Dim IsCodePointText As Boolean
If sEncodedText.ToString.ToLower.IndexOf(L_MARK_LEFT) = -1 Then
' Nothing to do
Return sEncodedText
End If
EncodedText.Append(sEncodedText)
' Extract the preamble
iStart = EncodedText.ToString.ToLower.IndexOf(L_MARK_LEFT)
If iStart = 0 Then
' Remove the first encoding name marker
EncodedText.Remove(iStart, 2)
iStart = EncodedText.ToString.ToLower.IndexOf(_question)
If iStart <> -1 Then
' Extract the encoding name
' Must convert to lowercase because we are highly likely
' to get UTF, Utf, utf, ISO, iso, et al.
EncodingName = EncodedText.ToString.ToLower.Substring(0, iStart)
' Now remove the encoding name, plus the second and third markers
If EncodedText.ToString.ToLower.Substring(EncodingName.Length, 3) _
= "?q?" Then
IsCodePointText = True
End If
EncodedText.Remove(0, iStart + 3)
' Remove the closing mark
EncodedText.Remove(EncodedText.Length - 2, 2)
If Not IsCodePointText Then
DecodedText = Encoding.GetEncoding( _
EncodingName.ToLower).GetString( _
Convert.FromBase64String(EncodedText.ToString))
DecodedText = DecodedText.Replace(_underscore, _space)
Return DecodedText
Else
' Locate the code points
Dim Filter As New Regex("=[\da-fA-F]{2}")
Dim m As Match = Filter.Match(EncodedText.ToString)
Dim WideChar As Char
While m.Success
If m.ToString <> _empty Then
WideChar = ChrW("&H" & m.ToString.Substring(1, 2))
EncodedText.Replace(m.ToString, WideChar.ToString)
End If
m = m.NextMatch
End While
EncodedText = EncodedText.Replace(_underscore, _space)
Return EncodeToWesternEuropean(EncodedText.ToString, _
EncodingName)
End If ' Not t IsCodePointText Then
End If ' iStart <> -1 Then
End If ' iStart = 0 Then
' We can't convert this string. Return the input
Return EncodedText.ToString
End Function
Private Function EncodeToWesternEuropean(ByVal TextToEncode As String, _
ByVal EncodingName As String) As String
Dim SourceEncoding As Encoding = Encoding.GetEncoding(EncodingName.ToLower)
Dim TargetEncoding As Encoding = Encoding.GetEncoding("iso-8859-1")
Return SourceEncoding.GetString(TargetEncoding.GetBytes(TextToEncode))
End Function
Rebuild the Text
The third and final step in the process is to replace each token in the template that was generated with the decoded/re-encoded text:
Public Function DecodeText(ByVal EncodedText As String) As String
Dim EncodingName As String = Nothing
Dim ParameterText As String = Nothing
Dim WorkingText As String = Nothing
Dim iIndex As Integer = 0
Dim CodeTokens As List(Of String) = BreakApartCodedText(EncodedText, _
ParameterText)
If CodeTokens Is Nothing Then
' The Unicode text is broken in some way
Return EncodedText
End If
For Each CodeToken As String In CodeTokens
WorkingText = ParameterText.Replace(_braceleft & iIndex.ToString & _
_braceright, _
DecodeToUnicode(CodeToken, _
EncodingName))
ParameterText = WorkingText
iIndex += 1
Next
Return WorkingText
End Function
The code samples above are part of a class. Download the sample project below to see the code working:
Download Sample Project
Here are some encodings you can use to paste into the text field in the sample project. The download file contains a number of different language encodings that you can test, including Cyrillic, Arabic, Japanese, Hebrew, Korean, Ukrainian, Chinese, Turkish, Thai, Greek, Pakistani and Hungarian. You will find them in comments inside the cHeaderDecoder class:
=?UTF-8?B?VmlzdGEg0ZYg0YPQutGA0LDRl9C90YHRjNC60LAg0LzQvtCy0LA=?=
=?GB2312?B?0MLPyruwzOKjrLvwsazNvMaso6zIpM7FxubA7aOsztLHwA==?==?GB2312?B?z8qjoaOh?=
=?Utf-8?B?zpHPgM6/z4PPhM6/zrvOriDOvM63zr3Phc68zqzPhM+Jzr0gzrzOtSDOny7OlS4=?=
All code examples on this site have been developed for .Net Framework 3.5 | |||