Improve how xgettext handles Perl syntax ambiguities.

author: Guido Flohr <guido@imperia.net> 2010-04-01 12:25:06 +0200
committer: Bruno Haible <bruno@clisp.org> 2010-04-01 12:25:28 +0200
commit: e53359672a6c99adeee2c0234dae0d0fbc47feed (patch)
tree: cc648b186dd15cb59f0a3a03acd3c64c4eab5e40 /gettext-tools
parent: 26aba54072c6155f632e33cba7e523e56326d512 (diff)
download: external_gettext-e53359672a6c99adeee2c0234dae0d0fbc47feed.zip
external_gettext-e53359672a6c99adeee2c0234dae0d0fbc47feed.tar.gz
external_gettext-e53359672a6c99adeee2c0234dae0d0fbc47feed.tar.bz2
7 files changed, 432 insertions, 66 deletions
diff --git a/gettext-tools/doc/ChangeLog b/gettext-tools/doc/ChangeLog
index 6105b93..4afbf4f 100644
--- a/gettext-tools/doc/ChangeLog
+++ b/gettext-tools/doc/ChangeLog
@@ -1,3 +1,9 @@
+2010-03-31  Guido Flohr  <guido@imperia.net>
+
+	More explanations about how xgettext handles Perl syntax ambiguities.
+	* gettext.texi (General Problems): Explain how xgettext disambiguates
+	conditional operator vs. regular expression.
+
 2010-03-13  Bruno Haible  <bruno@clisp.org>
 
 	New options --color, --style for many programs.
diff --git a/gettext-tools/doc/gettext.texi b/gettext-tools/doc/gettext.texi
index 04b9ee3..d342601 100644
--- a/gettext-tools/doc/gettext.texi
+++ b/gettext-tools/doc/gettext.texi
@@ -10997,6 +10997,48 @@ the semantic context.  If a slash is really a division sign but
 mis-interpreted as a pattern match, the rest of the input file is most
 probably parsed incorrectly.
 
+There are certain cases, where the ambiguity cannot be resolved at all:
+
+@example
+$x = wantarray ? 1 : 0;
+@end example
+
+The Perl built-in function @code{wantarray} does not accept any arguments.
+The Perl parser therefore knows that the question mark does not start
+a regular expression but is the ternary conditional operator.
+
+@example 
+sub wantarrays @{@}
+$x = wantarrays ? 1 : 0;
+@end example
+
+Now the situation is different.  The function @code{wantarrays} takes
+a variable number of arguments (like any non-prototyped Perl function).
+The question mark is now the delimiter of a pattern match, and hence
+the piece of code does not compile.
+
+@example
+sub wantarrays() @{@}
+$x = wantarrays ? 1 : 0;
+@end example
+
+Now the function is prototyped, Perl knows that it does not accept any
+arguments, and the question mark is therefore interpreted as the
+ternaray operator again.  But that unfortunately outsmarts @code{xgettext}.
+
+The Perl parser in @code{xgettext} cannot know whether a function has
+a prototype and what that prototype would look like.  It therefore makes
+an educated guess.  If a function is known to be a Perl built-in and
+this function does not accept any arguments, a following question mark
+or slash is treated as an operator, otherwise as the delimiter of a
+following regular expression.  The Perl built-ins that do not accept
+arguments are @code{wantarray}, @code{fork}, @code{time}, @code{times},
+@code{getlogin}, @code{getppid}, @code{getpwent}, @code{getgrent},
+@code{gethostent}, @code{getnetent}, @code{getprotoent}, @code{getservent},
+@code{setpwent}, @code{setgrent}, @code{endpwent}, @code{endgrent},
+@code{endhostent}, @code{endnetent}, @code{endprotoent}, and
+@code{endservent}.
+
 If you find that @code{xgettext} fails to extract strings from
 portions of your sources, you should therefore look out for slashes
 and/or question marks preceding these sections.  You may have come
@@ -11004,6 +11046,17 @@ across a bug in @code{xgettext}'s Perl parser (and of course you
 should report that bug).  In the meantime you should consider to
 reformulate your code in a manner less challenging to @code{xgettext}.
 
+In particular, if the parser is too dumb to see that a function
+does not accept arguments, use parentheses:
+
+@example
+$x = somefunc() ? 1 : 0;
+$y = (somefunc) ? 1 : 0;
+@end example
+
+In fact the Perl parser itself has similar problems and warns you
+about such constructs.
+
 @node Default Keywords, Special Keywords, General Problems, Perl
 @subsubsection Which keywords will xgettext look for?
 @cindex Perl default keywords
diff --git a/gettext-tools/src/ChangeLog b/gettext-tools/src/ChangeLog
index 47183dd..cb00dc6 100644
--- a/gettext-tools/src/ChangeLog
+++ b/gettext-tools/src/ChangeLog
@@ -1,5 +1,28 @@
 2010-03-31  Guido Flohr  <guido@imperia.net>
 
+	Improve how xgettext handles Perl syntax ambiguities.
+	* x-perl.c(enum token_type_ty): New enumeration items
+	token_type_number, token_type_object.
+	(struct token_ty): New field 'last_type'.
+	(token2string): Handle token_type_number, token_type_object.
+	(free_token): Likewise.
+	(prefer_division_over_regexp): Remove variable.
+	(extract_variable): Recognize token of type token_type_object.
+	(prefer_regexp_over_division): New function.
+	(last_token_type): Renamed from last_token.
+	(x_perl_prelex): Assign the token's last_type. Recognize token of type
+	token_type_number. Don't special-case "grep" and "split". Invoke
+	prefer_regexp_over_division for disambiguation.
+	(token_stack_dump): Handle token_type_number, token_type_object.
+	(x_perl_lex): Assign the token's last_type. Update last_token_type
+	intelligently.
+	(collect_message): Invoke prefer_regexp_over_division for
+	disambiguation.
+	(extract_balanced): Don't set last_token_type here. Handle
+	token_type_number, token_type_object.
+	(extract_perl): Initialize last_token_type here.
+	Reported by Guillem Jover <guillem@debian.org> via Santiago Vila.
+
 	* x-perl.c (x_perl_prelex): Clarify interpolate_keywords arguments.
 
 	* x-perl.c (eaten_here): Renamed from here_eaten.
diff --git a/gettext-tools/src/x-perl.c b/gettext-tools/src/x-perl.c
index c2e9fee..c7843c4 100644
--- a/gettext-tools/src/x-perl.c
+++ b/gettext-tools/src/x-perl.c
@@ -1,7 +1,7 @@
 /* xgettext Perl backend.
    Copyright (C) 2002-2010 Free Software Foundation, Inc.
 
-   This file was written by Guido Flohr <guido@imperia.net>, 2002-2003.
+   This file was written by Guido Flohr <guido@imperia.net>, 2002-2010.
 
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -234,7 +234,7 @@ phase1_getc ()
       ++line_number;
 
       /* Undosify.  This is important for catching the end of <<EOF and
-         <<'EOF'.  We could rely on stdio doing this for us but you
+         <<'EOF'.  We could rely on stdio doing this for us but
          it is not uncommon to to come across Perl scripts with CRLF
          newline conventions on systems that do not follow this
          convention.  */
@@ -506,15 +506,18 @@ enum token_type_ty
   token_type_rparen,            /* ) */
   token_type_comma,             /* , */
   token_type_fat_comma,         /* => */
-  token_type_dereference,       /* , */
+  token_type_dereference,       /* -> */
   token_type_semicolon,         /* ; */
   token_type_lbrace,            /* { */
   token_type_rbrace,            /* } */
   token_type_lbracket,          /* [ */
   token_type_rbracket,          /* ] */
   token_type_string,            /* quote-like */
+  token_type_number,            /* starting with a digit o dot */
   token_type_named_op,          /* if, unless, while, ... */
   token_type_variable,          /* $... */
+  token_type_object,            /* A dereferenced variable, maybe a blessed
+                                   object.  */
   token_type_symbol,            /* symbol, number */
   token_type_regex_op,          /* s, tr, y, m.  */
   token_type_dot,               /* . */
@@ -548,12 +551,14 @@ typedef struct token_ty token_ty;
 struct token_ty
 {
   token_type_ty type;
+  token_type_ty last_type;
   int sub_type;                 /* for token_type_string, token_type_symbol */
   char *string;                 /* for:                 in encoding:
                                    token_type_named_op  ASCII
                                    token_type_string    UTF-8
                                    token_type_symbol    ASCII
                                    token_type_variable  global_source_encoding
+                                   token_type_object    global_source_encoding
                                  */
   refcounted_string_list_ty *comment; /* for token_type_string */
   int line_number;
@@ -589,10 +594,14 @@ token2string (const token_ty *token)
       return "token_type_rbracket";
     case token_type_string:
       return "token_type_string";
+    case token_type_number:
+      return "token type number";
     case token_type_named_op:
       return "token_type_named_op";
     case token_type_variable:
       return "token_type_variable";
+    case token_type_object:
+      return "token_type_object";
     case token_type_symbol:
       return "token_type_symbol";
     case token_type_regex_op:
@@ -617,6 +626,7 @@ free_token (token_ty *tp)
     case token_type_string:
     case token_type_symbol:
     case token_type_variable:
+    case token_type_object:
       free (tp->string);
       break;
     default:
@@ -748,14 +758,6 @@ extract_quotelike_pass1_utf8 (int delim)
 /* ========= Reading of tokens and commands.  Extracting strings.  ========= */
 
 
-/* There is an ambiguity about '/': It can start a division operator ('/' or
-   '/=') or it can start a regular expression.  The distinction is important
-   because inside regular expressions, '#' loses its special meaning.
-   The distinction is possible depending on the parsing state: After a
-   variable or simple expression, it's a division operator; at the beginning
-   of an expression, it's a regexp.  */
-static bool prefer_division_over_regexp;
-
 /* Context lookup table.  */
 static flag_context_list_table_ty *flag_context_list_table;
 
@@ -1436,8 +1438,6 @@ extract_variable (message_list_ty *mlp, token_ty *tp, int first)
            real_file_name, line_number, tp->string);
 #endif
 
-  prefer_division_over_regexp = true;
-
   /*
    * 3) If the following looks strange to you, this is valid Perl syntax:
    *
@@ -1483,6 +1483,7 @@ extract_variable (message_list_ty *mlp, token_ty *tp, int first)
 
       if (maybe_hash_value && is_dereference)
         {
+          tp->type = token_type_object;
 #if DEBUG_PERL
           fprintf (stderr, "%s:%d: first keys preceded by \"->\"\n",
                    real_file_name, line_number);
@@ -2002,9 +2003,94 @@ interpolate_keywords (message_list_ty *mlp, const char *string, int lineno)
     }
 }
 
-/* The last token seen in the token stream.  This is important for the
-   interpretation of '?' and '/'.  */
-static token_type_ty last_token;
+/* There is an ambiguity about '/' and '?': They can start an operator
+   (division operator '/' or '/=' or the conditional operator '?'), or they can
+   start a regular expression.  The distinction is important because inside
+   regular expressions, '#' loses its special meaning.  This function helps
+   making the decision (a heuristic).  See the documentation for details.  */
+static bool
+prefer_regexp_over_division (token_type_ty type)
+{
+  bool retval = true;
+
+  switch (type)
+    {
+      case token_type_eof:
+        retval = true;
+        break;
+      case token_type_lparen:
+        retval = true;
+        break;
+      case token_type_rparen:
+        retval = false;
+        break;
+      case token_type_comma:
+        retval = true;
+        break;
+      case token_type_fat_comma:
+        retval = true;
+        break;
+      case token_type_dereference:
+        retval = true;
+        break;
+      case token_type_semicolon:
+        retval = true;
+        break;
+      case token_type_lbrace:
+        retval = true;
+        break;
+      case token_type_rbrace:
+        retval = false;
+        break;
+      case token_type_lbracket:
+        retval = true;
+        break;
+      case token_type_rbracket:
+        retval = false;
+        break;
+      case token_type_string:
+        retval = false;
+        break;
+      case token_type_number:
+        retval = false;
+        break;
+      case token_type_named_op:
+        retval = true;
+        break;
+      case token_type_variable:
+        retval = false;
+        break;
+      case token_type_object:
+        retval = false;
+        break;
+      case token_type_symbol:
+      case token_type_keyword_symbol:
+        retval = true;
+        break;
+      case token_type_regex_op:
+        retval = false;
+        break;
+      case token_type_dot:
+        retval = true;
+        break;
+      case token_type_other:
+        retval = true;
+        break;
+  }
+
+#if DEBUG_PERL
+  token_ty ty;
+  ty.type = type;
+  fprintf (stderr, "Prefer regexp over division after %s: %s\n",
+           token2string (&ty), retval ? "true" : "false");
+#endif
+
+  return retval;
+}
+
+/* Last token type seen in the stream.  Important for the interpretation
+   of slash and question mark.  */
+static token_type_ty last_token_type;
 
 /* Combine characters into tokens.  Discard whitespace.  */
 
@@ -2020,6 +2106,7 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
     {
       c = phase2_getc ();
       tp->line_number = line_number;
+      tp->last_type = last_token_type;
 
       switch (c)
         {
@@ -2043,7 +2130,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
           if (!extract_all)
             {
               extract_variable (mlp, tp, c);
-              prefer_division_over_regexp = true;
               return;
             }
           break;
@@ -2060,17 +2146,11 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
             if (c2 == '.')
               {
                 tp->type = token_type_other;
-                prefer_division_over_regexp = false;
                 return;
               }
-            else if (c2 >= '0' && c2 <= '9')
-              {
-                prefer_division_over_regexp = false;
-              }
-            else
+            else if (!(c2 >= '0' && c2 <= '9'))
               {
                 tp->type = token_type_dot;
-                prefer_division_over_regexp = true;
                 return;
               }
           }
@@ -2089,7 +2169,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
         case '0': case '1': case '2': case '3': case '4':
         case '5': case '6': case '7': case '8': case '9':
           /* Symbol, or part of a number.  */
-          prefer_division_over_regexp = true;
           bufpos = 0;
           for (;;)
             {
@@ -2154,7 +2233,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
             {
               tp->type = token_type_named_op;
               tp->string = xstrdup (buffer);
-              prefer_division_over_regexp = false;
               return;
             }
           else if (strcmp (buffer, "s") == 0
@@ -2180,7 +2258,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                   tp->type = token_type_symbol;
                   tp->sub_type = symbol_type_none;
                   tp->string = xstrdup (buffer);
-                  prefer_division_over_regexp = true;
                   return;
                 }
               extract_triple_quotelike (mlp, tp, delim,
@@ -2214,7 +2291,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                   tp->type = token_type_symbol;
                   tp->sub_type = symbol_type_none;
                   tp->string = xstrdup (buffer);
-                  prefer_division_over_regexp = true;
                   return;
                 }
               extract_quotelike (tp, delim);
@@ -2223,7 +2299,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
               free (tp->string);
               drop_reference (tp->comment);
               tp->type = token_type_regex_op;
-              prefer_division_over_regexp = true;
 
               /* Eat the following modifiers.  */
               do
@@ -2253,7 +2328,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                   tp->type = token_type_eof;
                   return;
                 }
-              prefer_division_over_regexp = true;
 
               if ((delim >= '0' && delim <= '9')
                   || (delim >= 'A' && delim <= 'Z')
@@ -2264,7 +2338,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                   tp->type = token_type_symbol;
                   tp->sub_type = symbol_type_none;
                   tp->string = xstrdup (buffer);
-                  prefer_division_over_regexp = true;
                   return;
                 }
 
@@ -2296,10 +2369,10 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                 }
               return;
             }
-          else if (strcmp (buffer, "grep") == 0
-                   || strcmp (buffer, "split") == 0)
+          else if ((buffer[0] >= '0' && buffer[0] <= '9') || buffer[0] == '.')
             {
-              prefer_division_over_regexp = false;
+              tp->type = token_type_number;
+              return;
             }
           tp->type = token_type_symbol;
           tp->sub_type = (strcmp (buffer, "sub") == 0
@@ -2309,21 +2382,18 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
           return;
 
         case '"':
-          prefer_division_over_regexp = true;
           extract_quotelike (tp, c);
           tp->sub_type = string_type_qq;
           interpolate_keywords (mlp, tp->string, line_number);
           return;
 
         case '`':
-          prefer_division_over_regexp = true;
           extract_quotelike (tp, c);
           tp->sub_type = string_type_qq;
           interpolate_keywords (mlp, tp->string, line_number);
           return;
 
         case '\'':
-          prefer_division_over_regexp = true;
           extract_quotelike (tp, c);
           tp->sub_type = string_type_q;
           return;
@@ -2336,42 +2406,34 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
           else
             phase2_ungetc (c);
           tp->type = token_type_lparen;
-          prefer_division_over_regexp = false;
           return;
 
         case ')':
           tp->type = token_type_rparen;
-          prefer_division_over_regexp = true;
           return;
 
         case '{':
           tp->type = token_type_lbrace;
-          prefer_division_over_regexp = false;
           return;
 
         case '}':
           tp->type = token_type_rbrace;
-          prefer_division_over_regexp = false;
           return;
 
         case '[':
           tp->type = token_type_lbracket;
-          prefer_division_over_regexp = false;
           return;
 
         case ']':
           tp->type = token_type_rbracket;
-          prefer_division_over_regexp = false;
           return;
 
         case ';':
           tp->type = token_type_semicolon;
-          prefer_division_over_regexp = false;
           return;
 
         case ',':
           tp->type = token_type_comma;
-          prefer_division_over_regexp = false;
           return;
 
         case '=':
@@ -2383,8 +2445,8 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
               return;
             }
           else if (linepos == 2
-                   && (last_token == token_type_semicolon
-                       || last_token == token_type_rbrace)
+                   && (last_token_type == token_type_semicolon
+                       || last_token_type == token_type_rbrace)
                    && ((c >= 'A' && c <='Z')
                        || (c >= 'a' && c <= 'z')))
             {
@@ -2401,12 +2463,10 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
             }
           phase1_ungetc (c);
           tp->type = token_type_other;
-          prefer_division_over_regexp = false;
           return;
 
         case '<':
           /* Check for <<EOF and friends.  */
-          prefer_division_over_regexp = false;
           c = phase1_getc ();
           if (c == '<')
             {
@@ -2509,19 +2569,17 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
             }
           phase1_ungetc (c);
           tp->type = token_type_other;
-          prefer_division_over_regexp = false;
           return;
 
         case '/':
         case '?':
-          if (!prefer_division_over_regexp)
+          if (prefer_regexp_over_division (tp->last_type))
             {
               extract_quotelike (tp, c);
               interpolate_keywords (mlp, tp->string, line_number);
               free (tp->string);
               drop_reference (tp->comment);
-              tp->type = token_type_other;
-              prefer_division_over_regexp = true;
+              tp->type = token_type_regex_op;
               /* Eat the following modifiers.  */
               do
                 c = phase1_getc ();
@@ -2544,7 +2602,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
              as we only need to recognize gettext invocations.  Don't
              bother.  */
           tp->type = token_type_other;
-          prefer_division_over_regexp = false;
           return;
         }
     }
@@ -2583,6 +2640,10 @@ token_stack_dump (token_stack_ty *stack)
         case token_type_variable:
           fprintf (stderr, "    string: %s\n", token->string);
           break;
+        case token_type_object:
+          fprintf (stderr, "    string: %s->\n", token->string);
+        default:
+          break;
         }
     }
   fprintf (stderr, "END STACK DUMP\n");
@@ -2651,10 +2712,69 @@ x_perl_lex (message_list_ty *mlp)
     {
       tp = XMALLOC (token_ty);
       x_perl_prelex (mlp, tp);
+      tp->last_type = last_token_type;
+      last_token_type = tp->type;
+
 #if DEBUG_PERL
       fprintf (stderr, "%s:%d: x_perl_prelex returned %s\n",
                real_file_name, line_number, token2string (tp));
 #endif
+
+      /* The interpretation of a slash or question mark after a function call
+         depends on the prototype of that function.  If the function expects
+         at least one argument, a regular expression is preferred, otherwise
+         an operator.  With our limited means, we can only guess here.  If
+         the function is a builtin that takes no arguments, we prefer an
+         operator by silently turning the last symbol into a variable instead
+         of a symbol.
+
+         Method calls without parentheses are not ambiguous.  After them, an
+         operator must follow.  Due to some ideosyncrasies in this parser
+         they are treated in two different manners.  If the call is
+         chained ($foo->bar->baz) the token left of the symbol is a
+         dereference operator.  If it is not chained ($foo->bar) the
+         dereference operator is consumed with the extracted variable.  The
+         latter case is handled below.  */
+      if (tp->type == token_type_symbol)
+        {
+          if (tp->last_type == token_type_dereference)
+            {
+              /* Class method call or chained method call (with at least
+                 two arrow operators).  */
+              last_token_type = token_type_variable;
+            }
+          else if (tp->last_type == token_type_object)
+            {
+              /* Instance method, not chained.  */
+              last_token_type = token_type_variable;
+            }
+          else if (strcmp (tp->string, "wantarray") == 0
+                   || strcmp (tp->string, "fork") == 0
+                   || strcmp (tp->string, "getlogin") == 0
+                   || strcmp (tp->string, "getppid") == 0
+                   || strcmp (tp->string, "getpwent") == 0
+                   || strcmp (tp->string, "getgrent") == 0
+                   || strcmp (tp->string, "gethostent") == 0
+                   || strcmp (tp->string, "getnetent") == 0
+                   || strcmp (tp->string, "getprotoent") == 0
+                   || strcmp (tp->string, "getservent") == 0
+                   || strcmp (tp->string, "setpwent") == 0
+                   || strcmp (tp->string, "setgrent") == 0
+                   || strcmp (tp->string, "endpwent") == 0
+                   || strcmp (tp->string, "endgrent") == 0
+                   || strcmp (tp->string, "endhostent") == 0
+                   || strcmp (tp->string, "endnetent") == 0
+                   || strcmp (tp->string, "endprotoent") == 0
+                   || strcmp (tp->string, "endservent") == 0
+                   || strcmp (tp->string, "time") == 0
+                   || strcmp (tp->string, "times") == 0
+                   || strcmp (tp->string, "wait") == 0
+                   || strcmp (tp->string, "wantarray") == 0)
+            {
+              /* A Perl built-in function that does not accept arguments.  */
+              last_token_type = token_type_variable;
+            }
+        }
     }
 #if DEBUG_PERL
   else
@@ -2789,7 +2909,8 @@ collect_message (message_list_ty *mlp, token_ty *tp, int error_level)
       phase2_ungetc (c);
 
       if (c == '"' || c == '\'' || c == '`'
-          || (!prefer_division_over_regexp && (c == '/' || c == '?'))
+          || ((c == '/' || c == '?')
+              && prefer_regexp_over_division (tp->last_type))
           || c == 'q')
         {
           token_ty *qstring = x_perl_lex (mlp);
@@ -2916,9 +3037,6 @@ extract_balanced (message_list_ty *mlp,
   ++nesting_level;
 #endif
 
-  last_token = token_type_semicolon;  /* Safe assumption.  */
-  prefer_division_over_regexp = false;
-
   for (;;)
     {
       /* The current token.  */
@@ -2926,8 +3044,6 @@ extract_balanced (message_list_ty *mlp,
 
       tp = x_perl_lex (mlp);
 
-      last_token = tp->type;
-
       if (delim == tp->type)
         {
           xgettext_current_source_encoding = po_charset_utf8;
@@ -3016,6 +3132,7 @@ extract_balanced (message_list_ty *mlp,
       switch (tp->type)
         {
         case token_type_symbol:
+        case token_type_keyword_symbol:
 #if DEBUG_PERL
           fprintf (stderr, "%s:%d: type symbol (%d) \"%s\"\n",
                    logical_file_name, tp->line_number, nesting_level,
@@ -3031,7 +3148,6 @@ extract_balanced (message_list_ty *mlp,
                 const struct callshapes *shapes =
                   (const struct callshapes *) keyword_value;
 
-                last_token = token_type_keyword_symbol;
                 next_shapes = shapes;
                 next_argparser = arglist_parser_alloc (mlp, shapes);
               }
@@ -3052,9 +3168,22 @@ extract_balanced (message_list_ty *mlp,
         case token_type_variable:
 #if DEBUG_PERL
           fprintf (stderr, "%s:%d: type variable (%d) \"%s\"\n",
-                   logical_file_name, tp->line_number, nesting_level, tp->string);
+                   logical_file_name, tp->line_number, nesting_level,
+                   tp->string);
+#endif
+          next_is_argument = false;
+          if (next_argparser != NULL)
+            free (next_argparser);
+          next_argparser = NULL;
+          next_context_iter = null_context_list_iterator;
+          break;
+
+        case token_type_object:
+#if DEBUG_PERL
+          fprintf (stderr, "%s:%d: type object (%d) \"%s->\"\n",
+                   logical_file_name, tp->line_number, nesting_level,
+                   tp->string);
 #endif
-          prefer_division_over_regexp = true;
           next_is_argument = false;
           if (next_argparser != NULL)
             free (next_argparser);
@@ -3217,6 +3346,18 @@ extract_balanced (message_list_ty *mlp,
           next_context_iter = null_context_list_iterator;
           break;
 
+        case token_type_number:
+#if DEBUG_PERL
+          fprintf (stderr, "%s:%d: type number (%d)\n",
+                   logical_file_name, tp->line_number, nesting_level);
+#endif
+          next_is_argument = false;
+          if (next_argparser != NULL)
+            free (next_argparser);
+          next_argparser = NULL;
+          next_context_iter = null_context_list_iterator;
+          break;
+
         case token_type_eof:
 #if DEBUG_PERL
           fprintf (stderr, "%s:%d: type EOF (%d)\n",
@@ -3429,6 +3570,9 @@ extract_perl (FILE *f, const char *real_filename, const char *logical_filename,
   eaten_here = 0;
   end_of_file = false;
 
+  /* Safe assumption.  */
+  last_token_type = token_type_semicolon;
+
   /* Eat tokens until eof is seen.  When extract_balanced returns
      due to an unbalanced closing brace, just restart it.  */
   while (!extract_balanced (mlp, token_type_rbrace, true, false,
@@ -3441,7 +3585,7 @@ extract_perl (FILE *f, const char *real_filename, const char *logical_filename,
   free (logical_file_name);
   logical_file_name = NULL;
   line_number = 0;
-  last_token = token_type_semicolon;
+  last_token_type = token_type_semicolon;
   token_stack_free (&token_stack);
   eaten_here = 0;
   end_of_file = true;
diff --git a/gettext-tools/tests/ChangeLog b/gettext-tools/tests/ChangeLog
index 4ca6a39..deb0123 100644
--- a/gettext-tools/tests/ChangeLog
+++ b/gettext-tools/tests/ChangeLog
@@ -1,3 +1,10 @@
+2010-03-31  Guido Flohr  <guido@imperia.net>
+
+	Improve how xgettext handles Perl syntax ambiguities.
+	* xgettext-perl-8: New file.
+	* Makefile.am (TESTS): Add it.
+	Reported by Guillem Jover <guillem@debian.org> via Santiago Vila.
+
 2009-12-26  Bruno Haible  <bruno@clisp.org>
 
 	Enable the gettext-6 and gettext-7 tests also on MacOS X.
diff --git a/gettext-tools/tests/Makefile.am b/gettext-tools/tests/Makefile.am
index 74dc13f..5a89a6e 100644
--- a/gettext-tools/tests/Makefile.am
+++ b/gettext-tools/tests/Makefile.am
@@ -1,5 +1,5 @@
 ## Makefile for the gettext-tools/tests subdirectory of GNU gettext
-## Copyright (C) 1995-1997, 2001-2009 Free Software Foundation, Inc.
+## Copyright (C) 1995-1997, 2001-2010 Free Software Foundation, Inc.
 ##
 ## This program is free software: you can redistribute it and/or modify
 ## it under the terms of the GNU General Public License as published by
@@ -86,7 +86,7 @@ TESTS = gettext-1 gettext-2 gettext-3 gettext-4 gettext-5 gettext-6 gettext-7 \
 	xgettext-lisp-1 xgettext-lisp-2 \
 	xgettext-objc-1 xgettext-objc-2 \
 	xgettext-perl-1 xgettext-perl-2 xgettext-perl-3 xgettext-perl-4 \
-	xgettext-perl-5 xgettext-perl-6 xgettext-perl-7 \
+	xgettext-perl-5 xgettext-perl-6 xgettext-perl-7 xgettext-perl-8 \
 	xgettext-php-1 xgettext-php-2 xgettext-php-3 xgettext-php-4 \
 	xgettext-po-1 \
 	xgettext-properties-1 \
diff --git a/gettext-tools/tests/xgettext-perl-8 b/gettext-tools/tests/xgettext-perl-8
new file mode 100755
index 0000000..74247cd
--- /dev/null
+++ b/gettext-tools/tests/xgettext-perl-8
@@ -0,0 +1,133 @@
+#! /bin/sh
+
+# The slash (/) and the question mark (?) serve a double-purpose in Perl.
+# Depending on the context they can either be an operator (division
+# or ternary respectively) or they are regex delimiters for pattern
+# matches.  This test case checks the proper recognition.
+
+tmpfiles=""
+trap 'rm -fr $tmpfiles' 1 2 3 15
+
+tmpfiles="$tmpfiles xg-pl-8.pl"
+cat <<\EOF > xg-pl-8.pl
+info(__("using %s."), ($a->b() eq "auto" ? "" : ""));
+
+print __"Question mark after string is an operator!\n";
+# ?; Re-sync.
+
+@times = sort {$a - $b} split /,\s*/, $options
+    if (defined $options && $options);
+
+print __"First slash in a an argument to a function starts a pattern match.";
+# /; Re-sync.
+
+$0 =~ /xyz/ ? 'foo' : 'bar';
+
+print __"Question mark after a regular pattern match is an operator!";
+# ?; Re-sync.
+
+$0 =~ m{xyz} ? 'foo' : 'bar';
+
+print __"Question mark after a nesting pattern match is an operator!";
+# ?; Re-sync.
+
+$0 =~ m|xyz| ? 'foo' : 'bar';
+
+print __"Question mark after a non-nesting pattern match is an operator!";
+# ?; Re-sync.
+
+print __(<<EOS);
+Line number for here documents is not correct.
+EOS
+
+$foo = wantarray ? 1 : 0;
+
+print __"The function wantarray does not take arguments!";
+# ?; Re-sync.
+
+$foo = Something->method ? 1 : 0;
+
+print __"Class method calls without parentheses do not accept arguments!";
+# ?; Re-sync.
+
+$foo = $Something->method ? 1 : 0;
+
+print __"Instance method calls without parentheses do not accept arguments!";
+# ?; Re-sync.
+
+$foo = $Some->thing->method ? 1 : 0;
+
+print __"Chained method calls without parentheses do not accept arguments!";
+# ?; Re-sync.
+
+print __"Synching works.";
+EOF
+
+tmpfiles="$tmpfiles xg-pl-8.tmp.po xg-pl-8.po"
+: ${XGETTEXT=xgettext}
+${XGETTEXT} --omit-header -n \
+  -k__ \
+  -d xg-pl-8.tmp xg-pl-8.pl
+test $? = 0 || { rm -fr $tmpfiles; exit 1; }
+LC_ALL=C tr -d '\r' < xg-pl-8.tmp.po > xg-pl-8.po
+test $? = 0 || { rm -fr $tmpfiles; exit 1; }
+
+tmpfiles="$tmpfiles xg-pl-8.ok"
+cat <<\EOF > xg-pl-8.ok
+#: xg-pl-8.pl:1
+#, perl-format
+msgid "using %s."
+msgstr ""
+
+#: xg-pl-8.pl:3
+msgid "Question mark after string is an operator!\n"
+msgstr ""
+
+#: xg-pl-8.pl:9
+msgid "First slash in a an argument to a function starts a pattern match."
+msgstr ""
+
+#: xg-pl-8.pl:14
+msgid "Question mark after a regular pattern match is an operator!"
+msgstr ""
+
+#: xg-pl-8.pl:19
+msgid "Question mark after a nesting pattern match is an operator!"
+msgstr ""
+
+#: xg-pl-8.pl:24
+msgid "Question mark after a non-nesting pattern match is an operator!"
+msgstr ""
+
+#: xg-pl-8.pl:28
+msgid "Line number for here documents is not correct.\n"
+msgstr ""
+
+#: xg-pl-8.pl:33
+msgid "The function wantarray does not take arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:38
+msgid "Class method calls without parentheses do not accept arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:43
+msgid "Instance method calls without parentheses do not accept arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:48
+msgid "Chained method calls without parentheses do not accept arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:51
+msgid "Synching works."
+msgstr ""
+EOF
+
+: ${DIFF=diff}
+${DIFF} xg-pl-8.ok xg-pl-8.po
+result=$?
+
+rm -fr $tmpfiles
+
+exit $result
author	Guido Flohr <guido@imperia.net>	2010-04-01 12:25:06 +0200
committer	Bruno Haible <bruno@clisp.org>	2010-04-01 12:25:28 +0200
commit	e53359672a6c99adeee2c0234dae0d0fbc47feed (patch)
tree	cc648b186dd15cb59f0a3a03acd3c64c4eab5e40 /gettext-tools
parent	26aba54072c6155f632e33cba7e523e56326d512 (diff)
download	external_gettext-e53359672a6c99adeee2c0234dae0d0fbc47feed.zip external_gettext-e53359672a6c99adeee2c0234dae0d0fbc47feed.tar.gz external_gettext-e53359672a6c99adeee2c0234dae0d0fbc47feed.tar.bz2