Quantcast
Channel: VBForums - CodeBank - Visual Basic 6 and earlier
Viewing all articles
Browse latest Browse all 1544

String functions for VB6 to handle BSTR strings with surrogate-pairs

$
0
0
Ok, I just knocked this out today. I've now tested fairly well, including edge conditions, so I'm done unless someone finds a bug.

Others are also certainly willing to test and report any problems.

For changes/updates, see "Version" comments in the code.

It'll be curious if anyone actually uses this thing. But it sure creates some discussion when this UCS-2 vs UTF-16 issue comes up.

For a BAS module:
Code:


Option Explicit
'
'  Version 1.00    ' Posted on VBForums CodeBank.
'  Version 1.01    ' Fixed LeftEx edge condition.
'  Version 1.02    ' Fixed math problem in MidEx.
'  Version 1.03    ' Fixed iStart default in InstrEx.
'  Version 1.04    ' Fixed logic problem in InstrEx and InstrRevEx.
'
' Some explanation:
'
'  The UTF-16 character set encoding is made up of the following:
'      The UCS-2 characters, which are always 2-bytes.
'      Surrogate-pair characters, which are always 4-bytes.
'
'  If a character is a surrogate-pair:
'      The low-order-word  is always in the range of &HDC00 to &HDFFF.
'      The high-order-word is always in the range of &HD800 to &HDBFF.
'
'  To avoid any possible confusion, if a character is not a surrogate-pair,
'  it can't be anywhere in the range between &HD800 and &HDFFF.
'  That's part of the UTF-16 specifications.
'
'  The built-in VB6 functions always assume the characters are UCS-2
'  characters, i.e., 2-bytes long.  Therefore, we need a special set
'  of functions to deal with strings that may contain surrogate-pairs.
'
'  Just as a note, the above does provide an opportunity for "garbage"
'  to be in a string.  For instance, if a word in a string is in the
'  range of >=&HD800 And <=&HDBFF, but the next word isn't in the
'  range of >=&HDC00 And <=&HDFFF, this would be garbage.  In the same
'  vane, if a word is in the range of >=&HDC00 And <=&HDFFF, but the
'  prior word isn't in the range of >=&HD800 And <=&HDBFF, this would
'  be garbage as well.  And the following don't check for such garbage,
'  and may return inaccurate results if a string has such garbage.

'
'  Functions reworked:
'      AscWEx
'      ChrWEx
'      InStrEx
'      InStrRevEx
'      LeftEx
'      LenEx
'      MidEx
'      RightEx
'
'      Split and Join should work just fine as they are.
'
'  Extra "helper" functions (that can be used by anyone):
'      IsUcs2Char
'      IsLowSurrogate
'      IsHighSurrogate
'      IsSurrogatePair
'      HasSurrogatePair
'      SurrogatePairCount
'
Private Declare Function GetMem4 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
Private Declare Function GetMem2 Lib "msvbvm60" (ByRef Source As Any, ByRef Dest As Any) As Long ' Always ignore the returned value, it's useless.
'


Public Function AscWEx(sChar As String) As Long
    ' Returns double-word, so surrogate-pairs can be represented.
    ' Examines only the first character of the string.
    '
    If IsSurrogatePair(sChar) Then  ' Words are swapped to accomodate Little Endian (LE).
        GetMem2 ByVal StrPtr(sChar) + 2&, AscWEx
        GetMem2 ByVal StrPtr(sChar), ByVal VarPtr(AscWEx) + 2&
    Else
        GetMem2 ByVal StrPtr(sChar), AscWEx
    End If
End Function

Public Function ChrWEx(ByVal iChar As Long) As String
    ' Ok, let's check for a bit of garbage.
    If iChar >= &HDC000000 And iChar <= &HDFFFFFFF Then Err.Raise 5&
    ' Now, let's just decide if we're dealing with a surrogate-pair or not.
    If iChar >= &HD8000000 And iChar <= &HDBFFFFFF Then
        ChrWEx = "  "
        GetMem2 iChar, ByVal StrPtr(ChrWEx) + 2&                ' Still must deal with LE going back in.
        GetMem2 ByVal VarPtr(iChar) + 2&, ByVal StrPtr(ChrWEx)
    Else
        ' If it's not a surrogate pair, we're going to ignore the high word.
        ChrWEx = ChrW$(CInt(iChar And &HFFFF))
    End If
End Function

Public Function InStrEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = 1&) As Long
    ' The optional iStart is at the end, as VB6 doesn't provide the kind of overloading the InStr() function does.
    ' All the InStrEx searches as done as vbBinaryCompare (the Instr() default) as that's the only want that really makes sense when searching for surrogate-pairs.
    ' iStart counts surrogate-pair characters only once, to respect them as characters.
    '
    ' The only real issue with this one (as opposed to Instr) is correctly handling iStart.
    '
    If iStart < 1& Then Err.Raise 5&        ' Same way Instr() handles it.
    Dim iPreCnt As Long
    iPreCnt = SurrogatePairCount(LeftEx(sHay, iStart - 1&))
    iStart = iStart + iPreCnt
    InStrEx = InStr(iStart, sHay, sNeedle, vbBinaryCompare) - iPreCnt
End Function

Public Function InStrRevEx(sHay As String, sNeedle As String, Optional ByVal iStart As Long = -1&) As Long
    ' All the InStrEx searches as done as vbBinaryCompare (the InstrRev() default) as that's the only want that really makes sense when searching for surrogate-pairs.
    ' iStart counts surrogate-pair characters only once, to respect them as characters.
    '
    Dim iPairCount As Long
    iPairCount = SurrogatePairCount(sHay)                      ' We'll need this a couple of times.
    Dim iLenEx As Long
    iLenEx = Len(sHay) - iPairCount                            ' We'll need this a couple of times.
    '
    If iStart = -1& Then iStart = iLenEx
    If iStart < 1& Then Err.Raise 5&                            ' Same way InStrRev() handles it.
    If iLenEx < iStart Then Exit Function                      ' Same way InStrRev() does it, even though it doesn't really make sense.
    '
    iStart = iStart + SurrogatePairCount(LeftEx(sHay, iStart))  ' Make it a no-surrogate-pair version, so we can use it in InStrRev().
    InStrRevEx = InStrRev(sHay, sNeedle, iStart)                ' But we haven't correctly handled surrogate-pairs, yet.
    If InStrRevEx > 1& Then                                    ' If it's one, we're good to go, either way.
        InStrRevEx = InStrRevEx - SurrogatePairCount(Left$(sHay, InStrRevEx - 1&))
    End If                                                      ' Above, adjust for surrogate pairs prior to our "find".
End Function

Public Function LeftEx(sStr As String, ByVal iLength As Long) As String
    ' We assume that iLength is characters, including surrogate-pairs (counted once each).
    '
    If iLength = 0& Then Exit Function                          ' Easy.
    If iLength < 0& Then Err.Raise 5&                          ' Same way Left$() handles it.
    '
    LeftEx = Left$(sStr, iLength + SurrogatePairCount(sStr))    ' Start by assuming they're all in the piece we want.
    Do                                                          ' Loop until we've trimmed to correct length.
        If LenEx(LeftEx) <= iLength Then Exit Function          ' Return when we've got the correct length (or asked for more than there are).
        '                                                      ' This test works even if there aren't surrogate-pairs in the tested piece.
        If IsLowSurrogate(Right$(LeftEx, 1&)) Then              ' Is the right-most word a high of a surrogate-pair?
            LeftEx = Left$(LeftEx, Len(LeftEx) - 2&)            ' Trim surrogate-pair.
        Else
            LeftEx = Left$(LeftEx, Len(LeftEx) - 1&)            ' Trim UCS-2 character.
        End If
    Loop
End Function

Public Function LenEx(sStr As String) As Long
    LenEx = Len(sStr) - SurrogatePairCount(sStr)
End Function

Public Function MidEx(sStr As String, ByVal iStart As Long, Optional ByVal iLength As Long) As String
    ' We assume that iStart and iLength is characters, including surrogate-pairs (counted once each).
    '
    If iStart < 1& Then Err.Raise 5&                            ' Same way Mid$() handles it.
    iStart = iStart - 1&                                        ' Make iStart 0 based, it's just easier.
    If iLength < 0& Then Err.Raise 5&                          ' Same way Mid$() handles it.
    If iLength = 0& Then iLength = &H7FFFFFFF                  ' Just makes it easy.  We want all that's remaining.
    Dim iLenEx As Long
    iLenEx = LenEx(sStr)
    If iLength > iLenEx - iStart Then                          ' Adjust length to be exactly what we want.
        iLength = iLenEx - iStart
    End If
    If iLength <= 0& Then Exit Function                        ' Return empty string, same way Mid$() does it.
    MidEx = LeftEx(RightEx(sStr, iLenEx - iStart), iLength)
End Function

Public Function RightEx(sStr As String, ByVal iLength As Long) As String
    ' We assume that iLength is characters, including surrogate-pairs (counted once each).
    '
    If iLength = 0& Then Exit Function                          ' Easy.
    If iLength < 0& Then Err.Raise 5&                          ' Same way Right$() handles it.
    '
    RightEx = Right$(sStr, iLength + SurrogatePairCount(sStr))  ' Start by assuming they're all in the piece we want.
    Do                                                          ' Loop until we've trimmed to correct length.
        If LenEx(RightEx) <= iLength Then Exit Function        ' Return when we've got the correct length (or asked for more than there are).
        '                                                      ' This test works even if there aren't surrogate-pairs in the tested piece.
        If IsSurrogatePair(RightEx) Then                        ' Is the left-most character a surrogate-pair?
            RightEx = Right$(RightEx, Len(RightEx) - 2&)        ' Trim surrogate-pair.
        Else
            RightEx = Right$(RightEx, Len(RightEx) - 1&)        ' Trim UCS-2 character.
        End If
    Loop
End Function



Public Function IsUcs2Char(sChar As String) As Boolean
    ' Only tests the first character of sChar.
    ' Just say "Not IsUcs2Char" to see if a word of a surrogate-pair.
    If Len(sChar) = 0& Then Exit Function
    Dim i As Integer:  i = AscW(sChar)
    IsUcs2Char = i < &HD800 Or i > &HDFFF
End Function

Public Function IsLowSurrogate(sChar As String) As Boolean
    ' Only tests the first character of sChar.
    If Len(sChar) = 0& Then Exit Function
    Dim i As Integer:  i = AscW(sChar)
    IsLowSurrogate = i >= &HDC00 And i <= &HDFFF
End Function

Public Function IsHighSurrogate(sChar As String) As Boolean
    ' Only tests the first character of sChar.
    If Len(sChar) = 0& Then Exit Function
    Dim i As Integer:  i = AscW(sChar)
    IsHighSurrogate = i >= &HD800 And i <= &HDBFF
End Function

Public Function IsSurrogatePair(sChar As String) As Boolean
    ' Looks precisely at the first FOUR bytes in sChar.
    If Len(sChar) = 0& Then Exit Function
    Static i(1&) As Integer
    GetMem4 ByVal StrPtr(sChar), i(0&)  ' This is safe because, if we're at the last character, we'll just get the null terminator.
    IsSurrogatePair = i(1&) >= &HDC00 And i(1&) <= &HDFFF And i(0&) >= &HD800 And i(0&) <= &HDBFF
End Function

Public Function HasSurrogatePair(sStr As String) As Boolean
    Dim bb() As Byte:  bb = sStr
    Dim i As Long
    For i = 1& To UBound(bb) Step 2&    ' We're looking only at the high-bytes.
        If bb(i) >= &HD8 And bb(i) <= &HDB Then
            If i + 2& <= UBound(bb) Then
                HasSurrogatePair = bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF
            End If
            Exit Function      ' If we return False here, the string has garbage.
        End If
    Next
End Function

Public Function SurrogatePairCount(sStr As String) As Long
    Dim bb() As Byte:  bb = sStr
    Dim i As Long
    For i = 1& To UBound(bb) Step 2&    ' We're looking only at the high-bytes.
        If bb(i) >= &HD8 And bb(i) <= &HDB Then
            If i + 2& <= UBound(bb) Then
                If bb(i + 2&) >= &HDC And bb(i + 2&) <= &HDF Then
                    SurrogatePairCount = SurrogatePairCount + 1&
                    i = i + 2&
                End If
            End If
        End If
    Next
End Function


There's some testing code in the next post.

Enjoy,
Elroy

Viewing all articles
Browse latest Browse all 1544

Latest Images

Trending Articles



Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>