diff options
author | jshin@chromium.org <jshin@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98> | 2009-06-24 16:44:49 +0000 |
---|---|---|
committer | jshin@chromium.org <jshin@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98> | 2009-06-24 16:44:49 +0000 |
commit | 8df44a01ec210a3e0c04191fb34b392727017a2c (patch) | |
tree | c0cabca440e09bc579955a9338219ffc27309e50 /base/string_util.h | |
parent | 9f9c5296b022dc280cd38ff418f5177cf71856d6 (diff) | |
download | chromium_src-8df44a01ec210a3e0c04191fb34b392727017a2c.zip chromium_src-8df44a01ec210a3e0c04191fb34b392727017a2c.tar.gz chromium_src-8df44a01ec210a3e0c04191fb34b392727017a2c.tar.bz2 |
Pass through non-character codepoints in UTF-8,16,32 and Wide conversion functions.
They're structurally valid code points unlike malformed byte/surrogate sequences. I believe it's better to leave them
alone in conversion functions.
This CL was triggered by file_util_unittest failure on Linux/Mac with my upcoming change
to file_util::ReplaceIllegalCharacters (a part of http://codereview.chromium.org/126223 ).
In addition, the upper bound for the output length in CodepageToWide was tightened.
TEST=pass string_util and file_util unittests
BUG=NONE
Review URL: http://codereview.chromium.org/147038
git-svn-id: svn://svn.chromium.org/chrome/trunk/src@19132 0039d316-1c4b-4281-b951-d872f2087c98
Diffstat (limited to 'base/string_util.h')
-rw-r--r-- | base/string_util.h | 14 |
1 files changed, 14 insertions, 0 deletions
diff --git a/base/string_util.h b/base/string_util.h index d17e7d7..9a033b4 100644 --- a/base/string_util.h +++ b/base/string_util.h @@ -186,6 +186,13 @@ string16 ASCIIToUTF16(const StringPiece& ascii); // do the best it can and put the result in the output buffer. The versions that // return strings ignore this error and just return the best conversion // possible. +// +// Note that only the structural validity is checked and non-character +// codepoints and unassigned are regarded as valid. +// TODO(jungshik): Consider replacing an invalid input sequence with +// the Unicode replacement character or adding |replacement_char| parameter. +// Currently, it's skipped in the ouput, which could be problematic in +// some situations. bool WideToUTF8(const wchar_t* src, size_t src_len, std::string* output); std::string WideToUTF8(const std::wstring& wide); bool UTF8ToWide(const char* src, size_t src_len, std::wstring* output); @@ -250,6 +257,13 @@ bool WideToLatin1(const std::wstring& wide, std::string* latin1); // string be 8-bit or UTF8? It contains only characters that are < 256 (in the // first case) or characters that use only 8-bits and whose 8-bit // representation looks like a UTF-8 string (the second case). +// +// Note that IsStringUTF8 checks not only if the input is structrually +// valid but also if it doesn't contain any non-character codepoint +// (e.g. U+FFFE). It's done on purpose because all the existing callers want +// to have the maximum 'discriminating' power from other encodings. If +// there's a use case for just checking the structural validity, we have to +// add a new function for that. bool IsString8Bit(const std::wstring& str); bool IsStringUTF8(const std::string& str); bool IsStringWideUTF8(const std::wstring& str); |