diff options
author | jshin <jshin@chromium.org> | 2016-03-18 11:42:39 -0700 |
---|---|---|
committer | Commit bot <commit-bot@chromium.org> | 2016-03-18 18:45:25 +0000 |
commit | 62a928390ba06db29576bbb32606696b3e16a66c (patch) | |
tree | 027a998922dd5972fec7866115b566ba7473d9f8 /url | |
parent | 840fc096207741050490badb7730f5a21b761fc4 (diff) | |
download | chromium_src-62a928390ba06db29576bbb32606696b3e16a66c.zip chromium_src-62a928390ba06db29576bbb32606696b3e16a66c.tar.gz chromium_src-62a928390ba06db29576bbb32606696b3e16a66c.tar.bz2 |
Implement a new IDN display policy
The new policy is language-indepedent, implemented with
ICU's uspoof API and is as following:
1. Use moderately restrictive rules for script mixing [1] with additional
restrictions on mixing with Latin.
- Script mixing is only allowed with ASCII-Latin (instead
of any Latin) + another script allowed at the moderatate
restriction level
2. Only allow the recommended sets from UTS 39 [2] and inclusion sets from
UAX 31 [3]. This is equivalent to [:IdentifierStatus=Allowed:] [4].
3. Allow 5 aspirational scripts from UAX 31 [5]
4. Do not allow labels with two or more numbering systems mixed.
5. Do not allow invisible characters or a sequence of the same
combining mark.
6. Turn off whole script confusable check. It'd block some common
domain labels like рф (IDN ccTLD for '.ru'),
'bücher' (German) and 'färgbolaget' (Swedish).
7. Keep ON 'mixed script confusable' check. This is different/separate
from 'script mixing restriction' and will catch cases like 'gօօgle'
with 'օ' (U+0585; Armenian Small Letter OH) [6] that would be otherwise
allowed by rules #1 ~ #5.
8. Block 4 Katakanas surrounded by non-Japanese scripts because they could be
mistaken as a slash. (this has been in place for a few years and is kept.)
9. Labels with any of four deviation characters (IDNA 2003 vs IDNA 2008)
encoded in punycode/ACE are always shown in Punycode. This is to make
the display policy consistent with our prior decision to use UTS 46
'transitional' processing (map or drop the 4 deviation characters.). [9]
10. Character black list (Mozilla's : [8]) is trimmed down to two characters.
Note that this is almost identical to Mozilla's IDN display algorithm
[7] except for #7, #8, and an additional restrictions in #1. #9 is another difference
because of Mozilla's use of UTS 46 'non-transitional' processing and our use of UTS 46 'transitional' processing.
Most of domains filtered out in ".com" TLD is filtered due to the
character set restrictiction (#2 and #3) that accounts for 94% (2,050)
of IDNs filtered out (0.2% of ~ 1 million IDNs in com TLD).
All the IDN TLDs are shown in Unicode. So are all the IDNs in the
effective TLD list, ".рф" (~ 860k), and ".みんな" (~25k).
48 out of 200k in ".xyz" and 3 out of 25k in ".jp" are filtered and shown
in punycode.
P.S. This CL keeps 'languages' parameter for the public APIs. I'll follow up
this CL with another to get rid of that parameter and adjust callers.
P.S.2: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome will be updated after this CL is landed.
[1] http://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[2] http://www.unicode.org/reports/tr39
http://www.unicode.org/Public/security/latest/xidmodifications.txt
[3] http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts
[4] http://goo.gl/L3WD1s
[5] http://www.unicode.org/reports/tr31/#Aspirational_Use_Scripts
[6] http://unicode.org/cldr/utility/confusables.jsp?a=o&r=None
[7] https://wiki.mozilla.org/IDN_Display_Algorithm
[8] http://kb.mozillazine.org/Network.IDN.blacklist_chars : Most of them
are blocked or mapped any way by other restrictions/mechanism in place.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1257108
[9] This is to "fix" bug 595263
BUG=336973,595263
TEST=components_unittests --gtest_filter=*IDN*, --gtest_filter=UrlForm*,
--gtest_filter=*Puny*
Review URL: https://codereview.chromium.org/1258813002
Cr-Commit-Position: refs/heads/master@{#382029}
Diffstat (limited to 'url')
-rw-r--r-- | url/url_canon_unittest.cc | 7 |
1 files changed, 7 insertions, 0 deletions
diff --git a/url/url_canon_unittest.cc b/url/url_canon_unittest.cc index 82edd0e..f5fedfc 100644 --- a/url/url_canon_unittest.cc +++ b/url/url_canon_unittest.cc @@ -402,6 +402,13 @@ TEST(URLCanonTest, Host) { // (added in Unicode 4.1). UTS 46 table 4 row (k) {"bc\xc8\xba.com", L"bc\x23a.com", "xn--bc-is1a.com", Component(0, 15), CanonHostInfo::NEUTRAL, -1, ""}, + // Maps U+FF43 (Full Width Small Letter C) to 'c'. + {"ab\xef\xbd\x83.xyz", L"ab\xff43.xyz", "abc.xyz", + Component(0, 7), CanonHostInfo::NEUTRAL, -1, ""}, + // Maps U+1D68C (Math Monospace Small C) to 'c'. + // U+1D68C = \xD835\xDE8C in UTF-16 + {"ab\xf0\x9d\x9a\x8c.xyz", L"ab\xd835\xde8c.xyz", "abc.xyz", + Component(0, 7), CanonHostInfo::NEUTRAL, -1, ""}, // BiDi check test // "Divehi" in Divehi (Thaana script) ends with BidiClass=NSM. // Disallowed in IDNA 2003 but now allowed in UTS 46/IDNA 2008. |