summaryrefslogtreecommitdiffstats
path: root/base/string_util.h
diff options
context:
space:
mode:
authorjshin@chromium.org <jshin@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>2009-06-24 16:44:49 +0000
committerjshin@chromium.org <jshin@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>2009-06-24 16:44:49 +0000
commit8df44a01ec210a3e0c04191fb34b392727017a2c (patch)
treec0cabca440e09bc579955a9338219ffc27309e50 /base/string_util.h
parent9f9c5296b022dc280cd38ff418f5177cf71856d6 (diff)
downloadchromium_src-8df44a01ec210a3e0c04191fb34b392727017a2c.zip
chromium_src-8df44a01ec210a3e0c04191fb34b392727017a2c.tar.gz
chromium_src-8df44a01ec210a3e0c04191fb34b392727017a2c.tar.bz2
Pass through non-character codepoints in UTF-8,16,32 and Wide conversion functions.
They're structurally valid code points unlike malformed byte/surrogate sequences. I believe it's better to leave them alone in conversion functions. This CL was triggered by file_util_unittest failure on Linux/Mac with my upcoming change to file_util::ReplaceIllegalCharacters (a part of http://codereview.chromium.org/126223 ). In addition, the upper bound for the output length in CodepageToWide was tightened. TEST=pass string_util and file_util unittests BUG=NONE Review URL: http://codereview.chromium.org/147038 git-svn-id: svn://svn.chromium.org/chrome/trunk/src@19132 0039d316-1c4b-4281-b951-d872f2087c98
Diffstat (limited to 'base/string_util.h')
-rw-r--r--base/string_util.h14
1 files changed, 14 insertions, 0 deletions
diff --git a/base/string_util.h b/base/string_util.h
index d17e7d7..9a033b4 100644
--- a/base/string_util.h
+++ b/base/string_util.h
@@ -186,6 +186,13 @@ string16 ASCIIToUTF16(const StringPiece& ascii);
// do the best it can and put the result in the output buffer. The versions that
// return strings ignore this error and just return the best conversion
// possible.
+//
+// Note that only the structural validity is checked and non-character
+// codepoints and unassigned are regarded as valid.
+// TODO(jungshik): Consider replacing an invalid input sequence with
+// the Unicode replacement character or adding |replacement_char| parameter.
+// Currently, it's skipped in the ouput, which could be problematic in
+// some situations.
bool WideToUTF8(const wchar_t* src, size_t src_len, std::string* output);
std::string WideToUTF8(const std::wstring& wide);
bool UTF8ToWide(const char* src, size_t src_len, std::wstring* output);
@@ -250,6 +257,13 @@ bool WideToLatin1(const std::wstring& wide, std::string* latin1);
// string be 8-bit or UTF8? It contains only characters that are < 256 (in the
// first case) or characters that use only 8-bits and whose 8-bit
// representation looks like a UTF-8 string (the second case).
+//
+// Note that IsStringUTF8 checks not only if the input is structrually
+// valid but also if it doesn't contain any non-character codepoint
+// (e.g. U+FFFE). It's done on purpose because all the existing callers want
+// to have the maximum 'discriminating' power from other encodings. If
+// there's a use case for just checking the structural validity, we have to
+// add a new function for that.
bool IsString8Bit(const std::wstring& str);
bool IsStringUTF8(const std::string& str);
bool IsStringWideUTF8(const std::wstring& str);