The first step towards supporting the Hungarian spell-checking dictionary.

This change fixes a couple of problems needed for using a Hungarian dictionary in Chrome. 1. Use TrimWhitespace() in TrimLine() Sorry, this is caused by my mistake that used TrimWhiteSpaceUTF8() without checking it deeply. 2. Replace morphing rules with compound rules. it seems existing Hungarian dictionaries use (language-specific) morphing rules to handle words that have both prefixes and suffixes, e.g. "legjobb" (best). It is better to replace such (language-dependent) morphing rules with (language-independent) compound rules to avoid language-specific issues. (As far as I tested, this change fixes many quality problems caused by Hungarian compounds.) This change also adds simple tests for our dictionary converter. BUG=15558 TEST=unit_test --gtest_filter=ConvertDictTest* Review URL: http://codereview.chromium.org/553087 git-svn-id: svn://svn.chromium.org/chrome/trunk/src@37816 0039d316-1c4b-4281-b951-d872f2087c98
author: hbono@chromium.org <hbono@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98> 2010-02-02 10:02:26 +0000
committer: hbono@chromium.org <hbono@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98> 2010-02-02 10:02:26 +0000
commit: bbffa669691c4d2f9f1ab8f95226171be7b2dd04 (patch)
tree: 0336c36be3524514fe5ab0f2d7341a6ae270877d /chrome/tools/convert_dict/dic_reader.cc
parent: 4f4c43ca4eed4bff261f6e4ff760a02455ef50aa (diff)
download: chromium_src-bbffa669691c4d2f9f1ab8f95226171be7b2dd04.zip
chromium_src-bbffa669691c4d2f9f1ab8f95226171be7b2dd04.tar.gz
chromium_src-bbffa669691c4d2f9f1ab8f95226171be7b2dd04.tar.bz2
1 files changed, 7 insertions, 0 deletions
diff --git a/chrome/tools/convert_dict/dic_reader.cc b/chrome/tools/convert_dict/dic_reader.cc
index 70c30a9..2233d04 100644
--- a/chrome/tools/convert_dict/dic_reader.cc
+++ b/chrome/tools/convert_dict/dic_reader.cc
@@ -106,6 +106,13 @@ bool PopulateWordSet(WordSet* word_set, FILE* file, AffReader* aff_reader,
         affix_index = aff_reader->GetAFIndexForAFString(split[1]);
     }
 
+    // Discard the morphological description if it is attached to the first
+    // token. (It is attached to the first token if a word doesn't have affix
+    // rules.)
+    size_t word_tab_offset = utf8word.find('\t');
+    if (word_tab_offset != std::string::npos)
+      utf8word = utf8word.substr(0, word_tab_offset);
+
     WordSet::iterator found = word_set->find(utf8word);
     if (found == word_set->end()) {
       std::set<int> affix_vector;
author	hbono@chromium.org <hbono@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>	2010-02-02 10:02:26 +0000
committer	hbono@chromium.org <hbono@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>	2010-02-02 10:02:26 +0000
commit	bbffa669691c4d2f9f1ab8f95226171be7b2dd04 (patch)
tree	0336c36be3524514fe5ab0f2d7341a6ae270877d /chrome/tools/convert_dict/dic_reader.cc
parent	4f4c43ca4eed4bff261f6e4ff760a02455ef50aa (diff)
download	chromium_src-bbffa669691c4d2f9f1ab8f95226171be7b2dd04.zip chromium_src-bbffa669691c4d2f9f1ab8f95226171be7b2dd04.tar.gz chromium_src-bbffa669691c4d2f9f1ab8f95226171be7b2dd04.tar.bz2