diff options
author | estade@chromium.org <estade@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98> | 2009-11-06 03:05:46 +0000 |
---|---|---|
committer | estade@chromium.org <estade@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98> | 2009-11-06 03:05:46 +0000 |
commit | 85c55dcd717445cd3763b5c94f9902b4cdd194b0 (patch) | |
tree | 2deea721cfac202e3eb8556f66a4cf317a331288 /chrome/renderer/spellchecker | |
parent | f1a8b962f0a6f1deb6c8c05a3f86d541e2ba61dd (diff) | |
download | chromium_src-85c55dcd717445cd3763b5c94f9902b4cdd194b0.zip chromium_src-85c55dcd717445cd3763b5c94f9902b4cdd194b0.tar.gz chromium_src-85c55dcd717445cd3763b5c94f9902b4cdd194b0.tar.bz2 |
Move the spellchecker to the renderer.
The motivation is that this removes the sync IPC on every call to the spellchecker. Also, currently we spellcheck in the IO thread, which frequently needs to go to disk (in particular, the entire spellcheck dictionary starts paged out), so this will block just the single renderer when that happens, rather than the whole IO thread.
This breaks the SpellChecker class into two new classes.
1) On the browser side, we have SpellCheckHost. This class handles browser-wide tasks, such as keeping the custom words list in sync with the on-disk custom words dictionary, downloading missing dictionaries, etc. On Posix, it also opens the bdic file since the renderer isn't allowed to open files. SpellCheckHost is created and destroyed on the UI thread. It is initialized on the file thread.
2) On the renderer side, SpellChecker2. This class will one day be renamed SpellChecker. It handles actual checking of the words, memory maps the dictionary file, loads hunspell, etc. There is one SpellChecker2 per RenderThread (hence one per render process).
My intention is for this patch to move Linux to this new approach, and follow up with ports for Windows (which will involve passing a dictionary file name rather than a file descriptor through to the renderer) and Mac (which will involve adding sync ViewHost IPC callsfor when the platform spellchecker is enabled). Note that anyone using the platform spellchecker rather than Hunspell will get no benefit out of this refactor.
There should be no loss of functionality for Linux (or any other platform) in this patch. The following should all still work:
- dictionary is loaded lazily
- hunspell is initialized lazily, per renderer
- language changes work.
- Dynamic downloading of new dictionaries
- auto spell correct works (as well as toggling it).
- disabling spellcheck works.
- custom words work (including adding in one renderer and immediately having it take effect in other renderers, for certain values of "immediate")
TODO:
- move spellchecker unit tests to test SpellCheck2
- add sync IPC for platform spellchecker; port to Mac
- add dictionary location fallback; port to Windows
- remove SpellChecker classes from browser/
BUG=25677
Review URL: http://codereview.chromium.org/357003
git-svn-id: svn://svn.chromium.org/chrome/trunk/src@31199 0039d316-1c4b-4281-b951-d872f2087c98
Diffstat (limited to 'chrome/renderer/spellchecker')
-rw-r--r-- | chrome/renderer/spellchecker/spellcheck.cc | 264 | ||||
-rw-r--r-- | chrome/renderer/spellchecker/spellcheck.h | 127 | ||||
-rw-r--r-- | chrome/renderer/spellchecker/spellcheck_worditerator.cc | 274 | ||||
-rw-r--r-- | chrome/renderer/spellchecker/spellcheck_worditerator.h | 183 |
4 files changed, 848 insertions, 0 deletions
diff --git a/chrome/renderer/spellchecker/spellcheck.cc b/chrome/renderer/spellchecker/spellcheck.cc new file mode 100644 index 0000000..a565b08 --- /dev/null +++ b/chrome/renderer/spellchecker/spellcheck.cc @@ -0,0 +1,264 @@ +// Copyright (c) 2009 The Chromium Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. + +#include "chrome/renderer/spellchecker/spellcheck.h" + +#include "base/file_util.h" +#include "base/histogram.h" +#include "base/time.h" +#include "chrome/renderer/render_thread.h" +#include "third_party/hunspell/src/hunspell/hunspell.hxx" + +static const int kMaxAutoCorrectWordSize = 8; +static const int kMaxSuggestions = 5; + +using base::TimeTicks; + +SpellCheck::SpellCheck() + : auto_spell_correct_turned_on_(false), + // TODO(estade): initialize this properly. + is_using_platform_spelling_engine_(false), + initialized_(false) { + // Wait till we check the first word before doing any initializing. +} + +SpellCheck::~SpellCheck() { +} + +void SpellCheck::Init(const base::FileDescriptor& fd, + const std::vector<std::string>& custom_words, + const std::string language) { + initialized_ = true; + hunspell_.reset(); + bdict_file_.reset(); + fd_ = fd; + character_attributes_.SetDefaultLanguage(language); + + custom_words_.insert(custom_words_.end(), + custom_words.begin(), custom_words.end()); + + // We delay the actual initialization of hunspell until it is needed. +} + +bool SpellCheck::SpellCheckWord( + const char16* in_word, + int in_word_len, + int tag, + int* misspelling_start, + int* misspelling_len, + std::vector<string16>* optional_suggestions) { + DCHECK(in_word_len >= 0); + DCHECK(misspelling_start && misspelling_len) << "Out vars must be given."; + + // Do nothing if we need to delay initialization. (Rather than blocking, + // report the word as correctly spelled.) + if (InitializeIfNeeded()) + return true; + + // Do nothing if spell checking is disabled. + if (initialized_ && fd_.fd == -1) + return true; + + *misspelling_start = 0; + *misspelling_len = 0; + if (in_word_len == 0) + return true; // No input means always spelled correctly. + + SpellcheckWordIterator word_iterator; + string16 word; + int word_start; + int word_length; + word_iterator.Initialize(&character_attributes_, in_word, in_word_len, true); + while (word_iterator.GetNextWord(&word, &word_start, &word_length)) { + // Found a word (or a contraction) that the spellchecker can check the + // spelling of. + if (CheckSpelling(word, tag)) + continue; + + // If the given word is a concatenated word of two or more valid words + // (e.g. "hello:hello"), we should treat it as a valid word. + if (IsValidContraction(word, tag)) + continue; + + *misspelling_start = word_start; + *misspelling_len = word_length; + + // Get the list of suggested words. + if (optional_suggestions) + FillSuggestionList(word, optional_suggestions); + return false; + } + + return true; +} + +string16 SpellCheck::GetAutoCorrectionWord(const string16& word, int tag) { + string16 autocorrect_word; + if (!auto_spell_correct_turned_on_) + return autocorrect_word; // Return the empty string. + + int word_length = static_cast<int>(word.size()); + if (word_length < 2 || word_length > kMaxAutoCorrectWordSize) + return autocorrect_word; + + if (InitializeIfNeeded()) + return autocorrect_word; + + char16 misspelled_word[kMaxAutoCorrectWordSize + 1]; + const char16* word_char = word.c_str(); + for (int i = 0; i <= kMaxAutoCorrectWordSize; i++) { + if (i >= word_length) + misspelled_word[i] = NULL; + else + misspelled_word[i] = word_char[i]; + } + + // Swap adjacent characters and spellcheck. + int misspelling_start, misspelling_len; + for (int i = 0; i < word_length - 1; i++) { + // Swap. + std::swap(misspelled_word[i], misspelled_word[i + 1]); + + // Check spelling. + misspelling_start = misspelling_len = 0; + SpellCheckWord(misspelled_word, word_length, tag, &misspelling_start, + &misspelling_len, NULL); + + // Make decision: if only one swap produced a valid word, then we want to + // return it. If we found two or more, we don't do autocorrection. + if (misspelling_len == 0) { + if (autocorrect_word.empty()) { + autocorrect_word.assign(misspelled_word); + } else { + autocorrect_word.clear(); + break; + } + } + + // Restore the swapped characters. + std::swap(misspelled_word[i], misspelled_word[i + 1]); + } + return autocorrect_word; +} + +void SpellCheck::EnableAutoSpellCorrect(bool turn_on) { + auto_spell_correct_turned_on_ = turn_on; +} + +void SpellCheck::WordAdded(const std::string& word) { + if (is_using_platform_spelling_engine_) + return; + + if (!hunspell_.get()) { + // Save it for later---add it when hunspell is initialized. + custom_words_.push_back(word); + } else { + AddWordToHunspell(word); + } +} + +void SpellCheck::InitializeHunspell() { + if (hunspell_.get()) + return; + + bdict_file_.reset(new file_util::MemoryMappedFile); + + if (bdict_file_->Initialize(fd_)) { + TimeTicks start_time = TimeTicks::Now(); + + hunspell_.reset( + new Hunspell(bdict_file_->data(), bdict_file_->length())); + + // Add custom words to Hunspell. + for (std::vector<std::string>::iterator it = custom_words_.begin(); + it != custom_words_.end(); ++it) { + AddWordToHunspell(*it); + } + + DHISTOGRAM_TIMES("Spellcheck.InitTime", + TimeTicks::Now() - start_time); + } +} + +void SpellCheck::AddWordToHunspell(const std::string& word) { + if (!word.empty() && word.length() < MAXWORDUTF8LEN) + hunspell_->add(word.c_str()); +} + +bool SpellCheck::InitializeIfNeeded() { + if (!initialized_) { + RenderThread::current()->RequestSpellCheckDictionary(); + initialized_ = true; + return true; + } + + // Check if the platform spellchecker is being used. + if (!is_using_platform_spelling_engine_ && fd_.fd != -1) { + // If it isn't, init hunspell. + InitializeHunspell(); + } + + return false; +} + +// When called, relays the request to check the spelling to the proper +// backend, either hunspell or a platform-specific backend. +bool SpellCheck::CheckSpelling(const string16& word_to_check, int tag) { + bool word_correct = false; + + if (is_using_platform_spelling_engine_) { + // TODO(estade): sync IPC to browser. + word_correct = true; + } else { + std::string word_to_check_utf8(UTF16ToUTF8(word_to_check)); + // Hunspell shouldn't let us exceed its max, but check just in case + if (word_to_check_utf8.length() < MAXWORDUTF8LEN) { + // |hunspell_->spell| returns 0 if the word is spelled correctly and + // non-zero otherwsie. + word_correct = (hunspell_->spell(word_to_check_utf8.c_str()) != 0); + } + } + + return word_correct; +} + +void SpellCheck::FillSuggestionList( + const string16& wrong_word, + std::vector<string16>* optional_suggestions) { + if (is_using_platform_spelling_engine_) { + // TODO(estade): sync IPC to browser. + return; + } + char** suggestions; + int number_of_suggestions = + hunspell_->suggest(&suggestions, UTF16ToUTF8(wrong_word).c_str()); + + // Populate the vector of WideStrings. + for (int i = 0; i < number_of_suggestions; i++) { + if (i < kMaxSuggestions) + optional_suggestions->push_back(UTF8ToUTF16(suggestions[i])); + free(suggestions[i]); + } + if (suggestions != NULL) + free(suggestions); +} + +// Returns whether or not the given string is a valid contraction. +// This function is a fall-back when the SpellcheckWordIterator class +// returns a concatenated word which is not in the selected dictionary +// (e.g. "in'n'out") but each word is valid. +bool SpellCheck::IsValidContraction(const string16& contraction, int tag) { + SpellcheckWordIterator word_iterator; + word_iterator.Initialize(&character_attributes_, contraction.c_str(), + contraction.length(), false); + + string16 word; + int word_start; + int word_length; + while (word_iterator.GetNextWord(&word, &word_start, &word_length)) { + if (!CheckSpelling(word, tag)) + return false; + } + return true; +} diff --git a/chrome/renderer/spellchecker/spellcheck.h b/chrome/renderer/spellchecker/spellcheck.h new file mode 100644 index 0000000..3b2e19d --- /dev/null +++ b/chrome/renderer/spellchecker/spellcheck.h @@ -0,0 +1,127 @@ +// Copyright (c) 2009 The Chromium Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. + +#ifndef CHROME_RENDERER_SPELLCHECKER_SPELLCHECKER_H_ +#define CHROME_RENDERER_SPELLCHECKER_SPELLCHECKER_H_ + +#include <queue> +#include <string> +#include <vector> + +#include "app/l10n_util.h" +#include "base/file_descriptor_posix.h" +#include "base/string16.h" +#include "base/time.h" +#include "chrome/renderer/spellchecker/spellcheck_worditerator.h" +#include "unicode/uscript.h" + +class Hunspell; + +namespace base { +class FileDescriptor; +} + +namespace file_util { +class MemoryMappedFile; +} + +class SpellCheck { + public: + SpellCheck(); + + ~SpellCheck(); + + void Init(const base::FileDescriptor& bdict_fd, + const std::vector<std::string>& custom_words, + const std::string language); + + // SpellCheck a word. + // Returns true if spelled correctly, false otherwise. + // If the spellchecker failed to initialize, always returns true. + // The |tag| parameter should either be a unique identifier for the document + // that the word came from (if the current platform requires it), or 0. + // In addition, finds the suggested words for a given word + // and puts them into |*optional_suggestions|. + // If the word is spelled correctly, the vector is empty. + // If optional_suggestions is NULL, suggested words will not be looked up. + // Note that Doing suggest lookups can be slow. + bool SpellCheckWord(const char16* in_word, + int in_word_len, + int tag, + int* misspelling_start, + int* misspelling_len, + std::vector<string16>* optional_suggestions); + + // Find a possible correctly spelled word for a misspelled word. Computes an + // empty string if input misspelled word is too long, there is ambiguity, or + // the correct spelling cannot be determined. + string16 GetAutoCorrectionWord(const string16& word, int tag); + + // Turn auto spell correct support ON or OFF. + // |turn_on| = true means turn ON; false means turn OFF. + void EnableAutoSpellCorrect(bool turn_on); + + // Add a word to the custom list. This may be called before or after + // |hunspell_| has been initialized. + void WordAdded(const std::string& word); + + private: + // Initializes the Hunspell dictionary, or does nothing if |hunspell_| is + // non-null. This blocks. + void InitializeHunspell(); + + // If there is no dictionary file, then this requests one from the browser + // and does not block. In this case it returns true. + // If there is a dictionary file, but Hunspell has not been loaded, then + // this loads Hunspell. + // If Hunspell is already loaded, this does nothing. In both the latter cases + // it returns false, meaning that it is OK to continue spellchecking. + bool InitializeIfNeeded(); + + // When called, relays the request to check the spelling to the proper + // backend, either hunspell or a platform-specific backend. + bool CheckSpelling(const string16& word_to_check, int tag); + + // When called, relays the request to fill the list with suggestions to + // the proper backend, either hunspell or a platform-specific backend. + void FillSuggestionList(const string16& wrong_word, + std::vector<string16>* optional_suggestions); + + // Returns whether or not the given word is a contraction of valid words + // (e.g. "word:word"). + bool IsValidContraction(const string16& word, int tag); + + // Add the given custom word to |hunspell_|. + void AddWordToHunspell(const std::string& word); + + // We memory-map the BDict file. + scoped_ptr<file_util::MemoryMappedFile> bdict_file_; + + // The hunspell dictionary in use. + scoped_ptr<Hunspell> hunspell_; + + base::FileDescriptor fd_; + std::vector<std::string> custom_words_; + + // Represents character attributes used for filtering out characters which + // are not supported by this SpellCheck object. + SpellcheckCharAttribute character_attributes_; + + // Remember state for auto spell correct. + bool auto_spell_correct_turned_on_; + + // True if a platform-specific spellchecking engine is being used, + // and False if hunspell is being used. + bool is_using_platform_spelling_engine_; + + // This flags whether we have ever been initialized, or have asked the browser + // for a dictionary. The value indicates whether we should request a + // dictionary from the browser when the render view asks us to check the + // spelling of a word. + bool initialized_; + + DISALLOW_COPY_AND_ASSIGN(SpellCheck); +}; + +#endif // CHROME_RENDERER_SPELLCHECKER_SPELLCHECKER_H_ diff --git a/chrome/renderer/spellchecker/spellcheck_worditerator.cc b/chrome/renderer/spellchecker/spellcheck_worditerator.cc new file mode 100644 index 0000000..827d9ee --- /dev/null +++ b/chrome/renderer/spellchecker/spellcheck_worditerator.cc @@ -0,0 +1,274 @@ +// Copyright (c) 2009 The Chromium Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. + +#include "chrome/renderer/spellchecker/spellcheck_worditerator.h" + +#include <map> +#include <string> + +#include "base/basictypes.h" +#include "base/string_util.h" +#include "chrome/renderer/spellchecker/spellcheck.h" + +#include "third_party/icu/public/common/unicode/normlzr.h" +#include "third_party/icu/public/common/unicode/schriter.h" +#include "third_party/icu/public/common/unicode/uchar.h" +#include "third_party/icu/public/common/unicode/uscript.h" +#include "third_party/icu/public/common/unicode/uset.h" +#include "third_party/icu/public/i18n/unicode/ulocdata.h" + +SpellcheckCharAttribute::SpellcheckCharAttribute() { + InitializeScriptTable(); + + // Even though many dictionaries treats numbers and contractions as words and + // treats USCRIPT_COMMON characters as word characters, the + // SpellcheckWordIterator class treats USCRIPT_COMMON characters as non-word + // characters to strictly-distinguish contraction characters from word + // characters. + SetWordScript(USCRIPT_COMMON, false); + + // Initialize the table of characters used for contractions. + // This array consists of the 'Midletter' and 'MidNumLet' characters of the + // word-break property list provided by Unicode, Inc.: + // http://www.unicode.org/Public/UNIDATA/auxiliary/WordBreakProperty.txt + static const UChar32 kMidLetters[] = { + L'\x003A', // MidLetter # COLON + L'\x00B7', // MidLetter # MIDDLE DOT + L'\x0387', // MidLetter # GREEK ANO TELEIA + L'\x05F4', // MidLetter # HEBREW PUNCTUATION GERSHAYIM + L'\x2027', // MidLetter # HYPHENATION POINT + L'\xFE13', // MidLetter # PRESENTATION FORM FOR VERTICAL COLON + L'\xFE55', // MidLetter # SMALL COLON + L'\xFF1A', // MidLetter # FULLWIDTH COLON + L'\x0027', // MidNumLet # APOSTROPHE + L'\x002E', // MidNumLet # FULL STOP + L'\x2018', // MidNumLet # LEFT SINGLE QUOTATION MARK + L'\x2019', // MidNumLet # RIGHT SINGLE QUOTATION MARK + L'\x2024', // MidNumLet # ONE DOT LEADER + L'\xFE52', // MidNumLet # SMALL FULL STOP + L'\xFF07', // MidNumLet # FULLWIDTH APOSTROPHE + L'\xFF0E', // MidNumLet # FULLWIDTH FULL STOP + }; + for (size_t i = 0; i < arraysize(kMidLetters); ++i) + middle_letters_[kMidLetters[i]] = true; +} + +SpellcheckCharAttribute::~SpellcheckCharAttribute() { +} + +// Sets the default language for this object. +// This function retrieves the exemplar set to set up the default character +// attributes. +void SpellcheckCharAttribute::SetDefaultLanguage(const std::string& language) { + UErrorCode status = U_ZERO_ERROR; + ULocaleData* locale_data = ulocdata_open(language.c_str(), &status); + if (U_FAILURE(status)) + return; + + // Retrieves the exemplar set of the given language and update the + // character-attribute table to treat its characters as word characters. + USet* exemplar_set = uset_open(1, 0); + ulocdata_getExemplarSet(locale_data, exemplar_set, 0, ULOCDATA_ES_STANDARD, + &status); + ulocdata_close(locale_data); + if (U_SUCCESS(status)) { + int length = uset_size(exemplar_set); + for (int i = 0; i < length; ++i) { + UChar32 character = uset_charAt(exemplar_set, i); + SetWordScript(GetScriptCode(character), true); + } + + // Many languages use combining characters to input their characters from + // keyboards. On the other hand, this exemplar set does not always include + // combining characters for such languages. + // To treat such combining characters as word characters, we decompose + // this exemplar set and treat the decomposed characters as word characters. + icu::UnicodeString composed; + for (int i = 0; i < length; ++i) + composed.append(uset_charAt(exemplar_set, i)); + + icu::UnicodeString decomposed; + icu::Normalizer::decompose(composed, FALSE, 0, decomposed, status); + if (U_SUCCESS(status)) { + icu::StringCharacterIterator iterator(decomposed); + UChar32 character = iterator.first32(); + while (character != icu::CharacterIterator::DONE) { + SetWordScript(GetScriptCode(character), true); + character = iterator.next32(); + } + } + } + uset_close(exemplar_set); +} + +// Returns whether or not the given character is a character used by the +// selected dictionary. +bool SpellcheckCharAttribute::IsWordChar(UChar32 character) const { + return IsWordScript(GetScriptCode(character)) && !u_isdigit(character); +} + +// Returns whether or not the given character is a character used by +// contractions. +bool SpellcheckCharAttribute::IsContractionChar(UChar32 character) const { + std::map<UChar32, bool>::const_iterator iterator; + iterator = middle_letters_.find(character); + if (iterator == middle_letters_.end()) + return false; + return iterator->second; +} + +// Initializes the mapping table. +void SpellcheckCharAttribute::InitializeScriptTable() { + for (size_t i = 0; i < arraysize(script_attributes_); ++i) + script_attributes_[i] = false; +} + +// Retrieves the ICU script code. +UScriptCode SpellcheckCharAttribute::GetScriptCode(UChar32 character) const { + UErrorCode status = U_ZERO_ERROR; + UScriptCode script_code = uscript_getScript(character, &status); + return U_SUCCESS(status) ? script_code : USCRIPT_INVALID_CODE; +} + +// Updates the mapping table from an ICU script code to its attribute, i.e. +// whether not a script is used by the selected dictionary. +void SpellcheckCharAttribute::SetWordScript(const int script_code, + bool in_use) { + if (script_code < 0 || + static_cast<size_t>(script_code) >= arraysize(script_attributes_)) + return; + script_attributes_[script_code] = in_use; +} + +// Returns whether or not the given script is used by the selected +// dictionary. +bool SpellcheckCharAttribute::IsWordScript( + const UScriptCode script_code) const { + if (script_code < 0 || + static_cast<size_t>(script_code) >= arraysize(script_attributes_)) + return false; + return script_attributes_[script_code]; +} + +SpellcheckWordIterator::SpellcheckWordIterator() + : word_(NULL), + length_(0), + position_(0), + allow_contraction_(false), + attribute_(NULL) { +} + +SpellcheckWordIterator::~SpellcheckWordIterator() { +} + +// Initialize a word-iterator object. +void SpellcheckWordIterator::Initialize( + const SpellcheckCharAttribute* attribute, + const char16* word, + size_t length, + bool allow_contraction) { + word_ = word; + position_ = 0; + length_ = static_cast<int>(length); + allow_contraction_ = allow_contraction; + attribute_ = attribute; +} + +// Retrieves a word (or a contraction). +// When a contraction is enclosed with contraction characters (e.g. 'isn't', +// 'rock'n'roll'), we should discard the beginning and the end of the +// contraction but we should never split the contraction. +// To handle this case easily, we should firstly extract a segment consisting +// of word characters and contraction characters, and discard contraction +// characters at the beginning and the end of the extracted segment. +bool SpellcheckWordIterator::GetNextWord(string16* word_string, + int* word_start, + int* word_length) { + word_string->empty(); + *word_start = 0; + *word_length = 0; + while (position_ < length_) { + int segment_start = 0; + int segment_end = 0; + GetSegment(&segment_start, &segment_end); + TrimSegment(segment_start, segment_end, word_start, word_length); + if (*word_length > 0) + return Normalize(*word_start, *word_length, word_string); + } + + return false; +} + +// Retrieves a segment consisting of word characters (and contraction +// characters if the |allow_contraction_| value is true). +// When the current position refers to a non-word character, this function +// returns a non-empty segment consisting of the character itself. In this +// case, the TrimSegment() function discards the character and returns an +// empty word (i.e. |word_length| == 0). +void SpellcheckWordIterator::GetSegment(int* segment_start, + int* segment_end) { + int position = position_; + while (position < length_) { + UChar32 character; + U16_NEXT(word_, position, length_, character); + if (!attribute_->IsWordChar(character)) { + if (!allow_contraction_ || !attribute_->IsContractionChar(character)) + break; + } + } + *segment_start = position_; + *segment_end = position; + position_ = position; +} + +// Discards non-word characters at the beginning and the end of the given +// segment. +void SpellcheckWordIterator::TrimSegment(int segment_start, + int segment_end, + int* word_start, + int* word_length) const { + while (segment_start < segment_end) { + UChar32 character; + int segment_next = segment_start; + U16_NEXT(word_, segment_next, segment_end, character); + if (attribute_->IsWordChar(character)) { + *word_start = segment_start; + break; + } + segment_start = segment_next; + } + while (segment_end >= segment_start) { + UChar32 character; + int segment_prev = segment_end; + U16_PREV(word_, segment_start, segment_prev, character); + if (attribute_->IsWordChar(character)) { + *word_length = segment_end - segment_start; + break; + } + segment_end = segment_prev; + } +} + +// Normalizes a non-terminated string into its canonical form so that +// a spellchecker object can check spellings of words which contain ligatures, +// full-width letters, etc. +// USCRIPT_LATIN does not only consists of US-ASCII and ISO/IEC 8859-1, but +// also consists of ISO/IEC 8859-{2,3,4,9,10}, ligatures, fullwidth latin, +// etc. For its details, please read the script table in +// "http://www.unicode.org/Public/UNIDATA/Scripts.txt". +bool SpellcheckWordIterator::Normalize(int input_start, + int input_length, + string16* output_string) const { + // Unicode Standard Annex #15 "http://www.unicode.org/unicode/reports/tr15/" + // does not only write NFKD and NFKC can compose ligatures into their ASCII + // alternatives, but also write NFKC keeps accents of characters. + // Therefore, NFKC seems to be the best option for hunspell. + icu::UnicodeString input(FALSE, &word_[input_start], input_length); + UErrorCode status = U_ZERO_ERROR; + icu::UnicodeString output; + icu::Normalizer::normalize(input, UNORM_NFKC, 0, output, status); + if (U_SUCCESS(status)) + output_string->assign(output.getTerminatedBuffer()); + return status == U_ZERO_ERROR || status == U_STRING_NOT_TERMINATED_WARNING; +} diff --git a/chrome/renderer/spellchecker/spellcheck_worditerator.h b/chrome/renderer/spellchecker/spellcheck_worditerator.h new file mode 100644 index 0000000..7763314 --- /dev/null +++ b/chrome/renderer/spellchecker/spellcheck_worditerator.h @@ -0,0 +1,183 @@ +// Copyright (c) 2009 The Chromium Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. + +#ifndef CHROME_RENDERER_SPELLCHECKER_SPELLCHECK_WORDITERATOR_H_ +#define CHROME_RENDERER_SPELLCHECKER_SPELLCHECK_WORDITERATOR_H_ + +#include <map> +#include <string> + +#include "base/basictypes.h" +#include "base/string16.h" + +#include "unicode/uscript.h" + +// A class which handles character attributes dependent on a spellchecker and +// its dictionary. +// This class is used by the SpellcheckWordIterator class to determine whether +// or not a character is one used by the spellchecker and its dictinary. +class SpellcheckCharAttribute { + public: + SpellcheckCharAttribute(); + + ~SpellcheckCharAttribute(); + + // Sets the default language of the spell checker. This controls which + // characters are considered parts of words of the given language. + void SetDefaultLanguage(const std::string& language); + + // Returns whether or not the given character is a character used by the + // selected dictionary. + // Parameters + // * character [in] (UChar32) + // Represents a Unicode character to be checked. + // Return values + // * true + // The given character is a word character. + // * false + // The given character is not a word character. + bool IsWordChar(UChar32 character) const; + + // Returns whether or not the given character is a character used by + // contractions. + // Parameters + // * character [in] (UChar32) + // Represents a Unicode character to be checked. + // Return values + // * true + // The given character is a character used by contractions. + // * false + // The given character is not a character used by contractions. + bool IsContractionChar(UChar32 character) const; + + private: + // Initializes the mapping table. + void InitializeScriptTable(); + + // Retrieves the ICU script code. + UScriptCode GetScriptCode(UChar32 character) const; + + // Updates an entry in the mapping table. + void SetWordScript(const int script_code, bool in_use); + + // Returns whether or not the given script is used by the selected + // dictionary. + bool IsWordScript(const UScriptCode script_code) const; + + private: + // Represents a mapping table from a script code to a boolean value + // representing whether or not the script is used by the selected dictionary. + bool script_attributes_[USCRIPT_CODE_LIMIT]; + + // Represents a table of characters used by contractions. + std::map<UChar32, bool> middle_letters_; + + DISALLOW_COPY_AND_ASSIGN(SpellcheckCharAttribute); +}; + +// A class which implements methods for finding the location of word boundaries +// used by the Spellchecker class. +// This class is implemented on the following assumptions: +// * An input string is encoded in UTF-16 (i.e. it may contain surrogate +// pairs), and; +// * The length of a string is the number of UTF-16 characters in the string +// (i.e. the length of a non-BMP character becomes two). +class SpellcheckWordIterator { + public: + SpellcheckWordIterator(); + + ~SpellcheckWordIterator(); + + // Initializes a word-iterator object. + // Parameters + // * attribute [in] (const SpellcheckCharAttribute*) + // Represents a set of character attributes used for filtering out + // non-word characters. + // * word [in] (const char16*) + // Represents a string from which this object extracts words. + // (This string does not have to be NUL-terminated.) + // * length [in] (size_t) + // Represents the length of the given string, in UTF-16 characters. + // This value should not include terminating NUL characters. + // * allow_contraction [in] (bool) + // Represents a flag to control whether or not this object should split a + // possible contraction (e.g. "isn't", "in'n'out", etc.) + // Return values + // * true + // This word-iterator object is initialized successfully. + // * false + // An error occured while initializing this object. + void Initialize(const SpellcheckCharAttribute* attribute, + const char16* word, + size_t length, + bool allow_contraction); + + // Retrieves a word (or a contraction). + // Parameters + // * word_string [out] (string16*) + // Represents a word (or a contraction) to be checked its spelling. + // This |word_string| has been already normalized to its canonical form + // (i.e. decomposed ligatures, replaced full-width latin characters to + // its ASCII alternatives, etc.) so that a SpellChecker object can check + // its spelling without any additional operations. + // On the other hand, a substring of the input string + // string16 str(&word[word_start], word_length); + // represents the non-normalized version of this extracted word. + // * word_start [out] (int*) + // Represents the offset of this word from the beginning of the input + // string, in UTF-16 characters. + // * word_length [out] (int*) + // Represents the length of an extracted word before normalization, in + // UTF-16 characters. + // When the input string contains ligatures, this value may not be equal + // to the length of the |word_string|. + // Return values + // * true + // Found a word (or a contraction) to be checked its spelling. + // * false + // Not found any more words or contractions to be checked their spellings. + bool GetNextWord(string16* word_string, + int* word_start, + int* word_length); + + private: + // Retrieves a segment consisting of word characters (and contraction + // characters if the |allow_contraction| value is true). + void GetSegment(int* segment_start, + int* segment_end); + + // Discards non-word characters at the beginning and the end of the given + // segment. + void TrimSegment(int segment_start, + int segment_end, + int* word_start, + int* word_length) const; + + // Normalizes the given segment of the |word_| variable and write its + // canonical form to the |output_string|. + bool Normalize(int input_start, + int input_length, + string16* output_string) const; + + private: + // The pointer to the input string from which we are extracting words. + const char16* word_; + + // The length of the original string. + int length_; + + // The current position in the original string. + int position_; + + // The flag to control whether or not this object should extract possible + // contractions. + bool allow_contraction_; + + // The character attributes used for filtering out non-word characters. + const SpellcheckCharAttribute* attribute_; + + DISALLOW_COPY_AND_ASSIGN(SpellcheckWordIterator); +}; + +#endif // CHROME_RENDERER_SPELLCHECKER_SPELLCHECK_WORDITERATOR_H_ |