diff options
author | initial.commit <initial.commit@0039d316-1c4b-4281-b951-d872f2087c98> | 2008-07-26 22:42:52 +0000 |
---|---|---|
committer | initial.commit <initial.commit@0039d316-1c4b-4281-b951-d872f2087c98> | 2008-07-26 22:42:52 +0000 |
commit | 586acc5fe142f498261f52c66862fa417c3d52d2 (patch) | |
tree | c98b3417a883f2477029c8cd5888f4078681e24e /net/base/registry_controlled_domain.h | |
parent | a814a8d55429605fe6d7045045cd25b6bf624580 (diff) | |
download | chromium_src-586acc5fe142f498261f52c66862fa417c3d52d2.zip chromium_src-586acc5fe142f498261f52c66862fa417c3d52d2.tar.gz chromium_src-586acc5fe142f498261f52c66862fa417c3d52d2.tar.bz2 |
Add net to the repository.
git-svn-id: svn://svn.chromium.org/chrome/trunk/src@14 0039d316-1c4b-4281-b951-d872f2087c98
Diffstat (limited to 'net/base/registry_controlled_domain.h')
-rw-r--r-- | net/base/registry_controlled_domain.h | 298 |
1 files changed, 298 insertions, 0 deletions
diff --git a/net/base/registry_controlled_domain.h b/net/base/registry_controlled_domain.h new file mode 100644 index 0000000..6b5adbc --- /dev/null +++ b/net/base/registry_controlled_domain.h @@ -0,0 +1,298 @@ +//* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ +/* ***** BEGIN LICENSE BLOCK ***** + * Version: MPL 1.1/GPL 2.0/LGPL 2.1 + * + * The contents of this file are subject to the Mozilla Public License Version + * 1.1 (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * http://www.mozilla.org/MPL/ + * + * Software distributed under the License is distributed on an "AS IS" basis, + * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License + * for the specific language governing rights and limitations under the + * License. + * + * The Original Code is Mozilla TLD Service + * + * The Initial Developer of the Original Code is + * Google Inc. + * Portions created by the Initial Developer are Copyright (C) 2006 + * the Initial Developer. All Rights Reserved. + * + * Contributor(s): + * Pamela Greene <pamg.bugs@gmail.com> (original author) + * + * Alternatively, the contents of this file may be used under the terms of + * either the GNU General Public License Version 2 or later (the "GPL"), or + * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), + * in which case the provisions of the GPL or the LGPL are applicable instead + * of those above. If you wish to allow use of your version of this file only + * under the terms of either the GPL or the LGPL, and not to allow others to + * use your version of this file under the terms of the MPL, indicate your + * decision by deleting the provisions above and replace them with the notice + * and other provisions required by the GPL or the LGPL. If you do not delete + * the provisions above, a recipient may use your version of this file under + * the terms of any one of the MPL, the GPL or the LGPL. + * + * ***** END LICENSE BLOCK ***** */ + +// NB: Modelled after Mozilla's code (originally written by Pamela Greene, +// later modified by others), but almost entirely rewritten for Chrome. + +/* + (Documentation based on the Mozilla documentation currently at + http://wiki.mozilla.org/Gecko:Effective_TLD_Service, written by the same + author.) + + The RegistryControlledDomainService examines the hostname of a GURL passed to + it and determines the longest portion that is controlled by a registrar. + Although technically the top-level domain (TLD) for a hostname is the last + dot-portion of the name (such as .com or .org), many domains (such as co.uk) + function as though they were TLDs, allocating any number of more specific, + essentially unrelated names beneath them. For example, .uk is a TLD, but + nobody is allowed to register a domain directly under .uk; the "effective" + TLDs are ac.uk, co.uk, and so on. We wouldn't want to allow any site in + *.co.uk to set a cookie for the entire co.uk domain, so it's important to be + able to identify which higher-level domains function as effective TLDs and + which can be registered. + + The service obtains its information about effective TLDs from a text resource + that must be in the following format: + + * It should use plain ASCII. + * It should contain one domain rule per line, terminated with \n, with nothing + else on the line. (The last rule in the file may omit the ending \n.) + * Rules should have been normalized using the same canonicalization that GURL + applies. For ASCII, that means they're not case-sensitive, among other + things; other normalizations are applied for other characters. + * Each rule should list the entire TLD-like domain name, with any subdomain + portions separated by dots (.) as usual. + * Rules should neither begin nor end with a dot. + * If a hostname matches more than one rule, the most specific rule (that is, + the one with more dot-levels) will be used. + * Other than in the case of wildcards (see below), rules do not implicitly + include their subcomponents. For example, "bar.baz.uk" does not imply + "baz.uk", and if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" + will match, but "baz.uk" and "qux.baz.uk" won't. + * The wildcard character '*' will match any valid sequence of characters. + * Wildcards may only appear as the entire most specific level of a rule. That + is, a wildcard must come at the beginning of a line and must be followed by + a dot. (You may not use a wildcard as the entire rule.) + * A wildcard rule implies a rule for the entire non-wildcard portion. For + example, the rule "*.foo.bar" implies the rule "foo.bar" (but not the rule + "bar"). This is typically important in the case of exceptions (see below). + * The exception character '!' before a rule marks an exception to a wildcard + rule. If your rules are "*.tokyo.jp" and "!pref.tokyo.jp", then + "a.b.tokyo.jp" has an effective TLD of "b.tokyo.jp", but "a.pref.tokyo.jp" + has an effective TLD of "tokyo.jp" (the exception prevents the wildcard + match, and we thus fall through to matching on the implied "tokyo.jp" rule + from the wildcard). + * If you use an exception rule without a corresponding wildcard rule, the + behavior is undefined. + + Firefox has a very similar service, and it's their data file we use to + construct our resource. However, the data expected by this implementation + differs from the Mozilla file in several important ways: + (1) We require that all single-level TLDs (com, edu, etc.) be explicitly + listed. As of this writing, Mozilla's file includes the single-level + TLDs too, but that might change. + (2) Our data is expected be in pure ASCII: all UTF-8 or otherwise encoded + items must already have been normalized. + (3) We do not allow comments, rule notes, blank lines, or line endings other + than LF. + Rules are also expected to be syntactically valid. + + The utility application tld_cleanup.exe converts a Mozilla-style file into a + Chrome one, making sure that single-level TLDs are explicitly listed, using + GURL to normalize rules, and validating the rules. +*/ + +#ifndef NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H__ +#define NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H__ + +#include <map> +#include <string> + +#include "base/basictypes.h" + +class GURL; + +// This class is a singleton. +class RegistryControlledDomainService { + public: + // Returns the registered, organization-identifying host and all its registry + // information, but no subdomains, from the given GURL. Returns an empty + // string if the GURL is invalid, has no host (e.g. a file: URL), has multiple + // trailing dots, is an IP address, has only one subcomponent (i.e. no dots + // other than leading/trailing ones), or is itself a recognized registry + // identifier. If no matching rule is found in the effective-TLD data (or in + // the default data, if the resource failed to load), the last subcomponent of + // the host is assumed to be the registry. + // + // Examples: + // http://www.google.com/file.html -> "google.com" (com) + // http://..google.com/file.html -> "google.com" (com) + // http://google.com./file.html -> "google.com." (com) + // http://a.b.co.uk/file.html -> "b.co.uk" (co.uk) + // file:///C:/bar.html -> "" (no host) + // http://foo.com../file.html -> "" (multiple trailing dots) + // http://192.168.0.1/file.html -> "" (IP address) + // http://bar/file.html -> "" (no subcomponents) + // http://co.uk/file.html -> "" (host is a registry) + // http://foo.bar/file.html -> "foo.bar" (no rule; assume bar) + static std::string GetDomainAndRegistry(const GURL& gurl); + + // Like the GURL version, but takes a host (which is canonicalized internally) + // instead of a full GURL. + static std::string GetDomainAndRegistry(const std::string& host); + static std::string GetDomainAndRegistry(const std::wstring& host); + + // This convenience function returns true if the two GURLs both have hosts + // and one of the following is true: + // * They each have a known domain and registry, and it is the same for both + // URLs. Note that this means the trailing dot, if any, must match too. + // * They don't have known domains/registries, but the hosts are identical. + // Effectively, callers can use this function to check whether the input URLs + // represent hosts "on the same site". + static bool SameDomainOrHost(const GURL& gurl1, const GURL& gurl2); + + // Finds the length in bytes of the registrar portion of the host in the + // given GURL. Returns std::string::npos if the GURL is invalid or has no + // host (e.g. a file: URL). Returns 0 if the GURL has multiple trailing dots, + // is an IP address, has no subcomponents, or is itself a recognized registry + // identifier. If no matching rule is found in the effective-TLD data (or in + // the default data, if the resource failed to load), returns 0 if + // |allow_unknown_registries| is false, or the length of the last subcomponent + // if |allow_unknown_registries| is true. + // + // Examples: + // http://www.google.com/file.html -> 3 (com) + // http://..google.com/file.html -> 3 (com) + // http://google.com./file.html -> 4 (com) + // http://a.b.co.uk/file.html -> 5 (co.uk) + // file:///C:/bar.html -> std::string::npos (no host) + // http://foo.com../file.html -> 0 (multiple trailing + // dots) + // http://192.168.0.1/file.html -> 0 (IP address) + // http://bar/file.html -> 0 (no subcomponents) + // http://co.uk/file.html -> 0 (host is a registry) + // http://foo.bar/file.html -> 0 or 3, depending (no rule; assume + // bar) + static size_t GetRegistryLength(const GURL& gurl, + bool allow_unknown_registries); + + // Like the GURL version, but takes a host (which is canonicalized internally) + // instead of a full GURL. + static size_t GetRegistryLength(const std::string& host, + bool allow_unknown_registries); + static size_t GetRegistryLength(const std::wstring& host, + bool allow_unknown_registries); + + protected: + // The entire protected API is only for unit testing. I mean it. Don't make + // me come over there! + RegistryControlledDomainService() { } + ~RegistryControlledDomainService() { } + + // Clears the static singleton instance. This is used by unit tests to + // create a new instance for each test, to help ensure test independence. + static void ResetInstance() { + delete instance_; + instance_ = NULL; + } + + // Sets the domain_data_ of the current instance (creating one, if necessary), + // then parses it. + static void UseDomainData(const std::string& data); + + private: + // Using the StringSegment class, we can compare portions of strings without + // needing to allocate or copy them. + class StringSegment { + public: + StringSegment() : data_(0), begin_(0), len_(0) { } + ~StringSegment() { } + + void Set(const char* data, size_t begin, size_t len) { + data_ = data; + begin_ = begin; + len_ = len; + } + + // Returns the character at the given offset from the start of the segment, + // or '\0' if the offset lies outside the segment. + char CharAt(size_t offset) const { + return (offset < len_) ? data_[begin_ + offset] : '\0'; + } + + // Removes a maximum of |trimmed| number of characters, up to the length of + // the segment, from the start of the StringSegment. + void TrimFromStart(size_t trimmed) { + if (trimmed > len_) + trimmed = len_; + begin_ += trimmed; + len_ -= trimmed; + } + + const char* data() const { return data_; } + + // This comparator is needed by std::map. Note that since we don't care + // about the exact sorting, we use a somewhat less intuitive, but efficient, + // comparison. + bool operator<(const StringSegment& other) const; + + private: + const char* data_; + size_t begin_; + size_t len_; + }; + + // The full domain rule data, loaded from a resource or set by a unit test. + std::string domain_data_; + + // An entry in the map of domain specifications, describing the properties + // that apply to that domain rule. + struct DomainEntry { + DomainEntry() : exception(false), wildcard(false) { } + bool exception; + bool wildcard; + }; + typedef std::map<StringSegment, DomainEntry> DomainMap; + + // A map from a StringSegment holding a domain name (rule) to its DomainEntry. + // The StringSegments in the domain_map_ hold pointers to the domain_data_ + // data; that's cheaper than copying the string data itself. + // TODO(pamg): Since all the domain_map_ entries have the same data_, it's + // redundant. Is it worth subclassing StringSegment to avoid that? + DomainMap domain_map_; + + // Parses a list of effective-TLD rules, building the domain_map_. Rules are + // assumed to be syntactically valid. + void ParseDomainData(); + + // The class's singleton instance. + static RegistryControlledDomainService* instance_; + + // Returns the singleton instance, after attempting to initialize it. + // NOTE that if the effective-TLD data resource can't be found, the instance + // will be initialized and continue operation with an empty domain_map_. + static RegistryControlledDomainService* GetInstance(); + + // Loads and parses the effective-TLD data resource. + void Init(); + + // Adds one rule, assumed to be valid, to the domain_map_. + // WARNING: As implied by the non-const status of the incoming rule, this + // method may MODIFY that rule (in particular, change its start and length). + // This is a performance optimization. + void AddRule(StringSegment* rule); + + // Internal workings of the static public methods. See above. + static std::string GetDomainAndRegistryImpl(const std::string& host); + size_t GetRegistryLengthImpl(const std::string& host, + bool allow_unknown_registries); + + DISALLOW_EVIL_CONSTRUCTORS(RegistryControlledDomainService); +}; + +#endif // NET_BASE_REGISTRY_CONTROLLED_DOMAIN_H__ |