下面看一下nextToken()方法的源码实现。
1、Java中的控制字符
case ' ': // (Spec 3.6) case '\t': // (Spec 3.6) case FF: // (Spec 3.6) 换页符 换页字符 do { scanChar(); // 操作的是bufferpointer指针的值 } while (ch == ' ' || ch == '\t' || ch == FF); endPos = bufpointer; processWhiteSpace(); break; case LF: // (Spec 3.4) scanChar(); endPos = bufpointer; processLineTerminator(); break; case CR: // (Spec 3.4) \r scanChar(); if (ch == LF) { // \n scanChar(); } endPos = bufpointer; processLineTerminator(); break;
关于 LF CR 参考https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.4
关于FF 或者\t 等参考https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.6
2、Java标识符
有如下规定:
(1)标识符是由字母、数字、下划线、美元($)符号组成的
(2)不能以数字开头
(3)不能是java中的关键字
(4)可以用中文,不会报错,但最好不要用中文
实现代码:
case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'H': case 'I': case 'J': case 'K': case 'L': case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R': case 'S': case 'T': case 'U': case 'V': case 'W': case 'X': case 'Y': case 'Z': case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'o': case 'p': case 'q': case 'r': case 's': case 't': case 'u': case 'v': case 'w': case 'x': case 'y': case 'z': case '$': case '_': scanIdent(); return;
3、数字的表示
case '0': scanChar(); if (ch == 'x' || ch == 'X') { // 例如int x = 0x101 scanChar(); skipIllegalUnderscores(); if (ch == '.') { scanHexFractionAndSuffix(false); } else if (digit(16) < 0) { lexError("invalid.hex.number"); } else { scanNumber(16); } } else if (ch == 'b' || ch == 'B') { // 例如int x = 0b101 // java7的新特性二进制字面量 if (!allowBinaryLiterals) { // source {0} 中不支持二进制文字\n(请使用 -source 7 或更高版本以启用二进制文字) lexError("unsupported.binary.lit", source.name); allowBinaryLiterals = true; } scanChar(); skipIllegalUnderscores(); if (digit(2) < 0) { // 二进制数字中必须包含至少一个二进制数 lexError("invalid.binary.number"); } else { scanNumber(2); } } else { putChar('0'); if (ch == '_') { int savePos = bufpointer; do { scanChar(); } while (ch == '_'); if (digit(10) < 0) { // 非法下划线 lexError(savePos, "illegal.underscore"); } } scanNumber(8); } return;
case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': scanNumber(10); return; case '.': scanChar(); if ('0' <= ch && ch <= '9') { putChar('.'); scanFractionAndSuffix(); } else if (ch == '.') { putChar('.'); putChar('.'); scanChar(); if (ch == '.') { scanChar(); putChar('.'); token = ELLIPSIS; } else { lexError("malformed.fp.lit"); } } else { token = DOT; } return;
4、斜杠
case '/': scanChar(); if (ch == '/') { do { scanCommentChar(); } while (ch != CR && ch != LF && bufpointer < buflen); if (bufpointer < buflen) { endPos = bufpointer; processComment(CommentStyle.LINE); } break; } else if (ch == '*') { // 处理文档注释 scanChar(); CommentStyle style; if (ch == '*') { style = CommentStyle.JAVADOC; scanDocComment(); } else { style = CommentStyle.BLOCK; while (bufpointer < buflen) { if (ch == '*') { scanChar(); if (ch == '/') break; } else { scanCommentChar(); } } } if (ch == '/') { scanChar(); endPos = bufpointer; processComment(style); break; } else { lexError("unclosed.comment"); return; } } else if (ch == '=') { name = names.slashequals; token = SLASHEQ; scanChar(); } else { name = names.slash; token = SLASH; } return;
5、反斜杠
case '\'': scanChar(); if (ch == '\'') { lexError("empty.char.lit"); } else { if (ch == CR || ch == LF) lexError(pos, "illegal.line.end.in.char.lit"); scanLitChar(); if (ch == '\'') { scanChar(); token = CHARLITERAL; } else { lexError(pos, "unclosed.char.lit"); } } return;
6、双引号
case '\"': scanChar(); while (ch != '\"' && ch != CR && ch != LF && bufpointer < buflen) scanLitChar(); if (ch == '\"') { token = STRINGLITERAL; scanChar(); } else { lexError(pos, "unclosed.str.lit"); } return;
7、默认处理
在Java中,哪些字符组合成为一个Token是通过调用nextToken方法实现的,每调用一次方法就会构造一个Token,而这些Token必然是com.sun.tools.javac.parser.Token中的任何元素之一。其定义如下:
/** An interface that defines codes for Java source tokens * returned from lexical analysis. */ public enum Token implements Formattable { EOF, ERROR, IDENTIFIER, // 如类名、包名、变量名、方法名等 ABSTRACT("abstract"), ASSERT("assert"), BOOLEAN("boolean"), BREAK("break"), BYTE("byte"), CASE("case"), CATCH("catch"), CHAR("char"), CLASS("class"), CONST("const"), CONTINUE("continue"), DEFAULT("default"), DO("do"), DOUBLE("double"), ELSE("else"), ENUM("enum"), EXTENDS("extends"), FINAL("final"), FINALLY("finally"), FLOAT("float"), FOR("for"), GOTO("goto"), IF("if"), IMPLEMENTS("implements"), IMPORT("import"), INSTANCEOF("instanceof"), INT("int"), INTERFACE("interface"), LONG("long"), NATIVE("native"), NEW("new"), PACKAGE("package"), PRIVATE("private"), PROTECTED("protected"), PUBLIC("public"), RETURN("return"), SHORT("short"), STATIC("static"), STRICTFP("strictfp"), SUPER("super"), SWITCH("switch"), SYNCHRONIZED("synchronized"), THIS("this"), THROW("throw"), THROWS("throws"), TRANSIENT("transient"), TRY("try"), VOID("void"), VOLATILE("volatile"), WHILE("while"), INTLITERAL, LONGLITERAL, FLOATLITERAL, DOUBLELITERAL, CHARLITERAL, STRINGLITERAL, TRUE("true"), FALSE("false"), NULL("null"), LPAREN("("), RPAREN(")"), LBRACE("{"), RBRACE("}"), LBRACKET("["), RBRACKET("]"), SEMI(";"), COMMA(","), DOT("."), ELLIPSIS("..."), EQ("="), GT(">"), LT("<"), BANG("!"), TILDE("~"), QUES("?"), COLON(":"), EQEQ("=="), LTEQ("<="), GTEQ(">="), BANGEQ("!="), AMPAMP("&&"), BARBAR("||"), PLUSPLUS("++"), SUBSUB("--"), PLUS("+"), SUB("-"), STAR("*"), SLASH("/"), AMP("&"), BAR("|"), CARET("^"), PERCENT("%"), LTLT("<<"), GTGT(">>"), GTGTGT(">>>"), PLUSEQ("+="), SUBEQ("-="), STAREQ("*="), SLASHEQ("/="), AMPEQ("&="), BAREQ("|="), CARETEQ("^="), PERCENTEQ("%="), LTLTEQ("<<="), GTGTEQ(">>="), GTGTGTEQ(">>>="), MONKEYS_AT("@"), CUSTOM; ... }
调用nextToken生成的字符集合都是一个Name对象,所有的Name对象都存储在Name.Table这个内部类中,可以参考另外一篇文章:
Keyworks会将在Token中所有的元素按照它们的Token.name先转化成Name对象,然后建立Name和Token的对应关系,这个关系保存在Keyworks类的key数组中。
Keywords类定义了如下重要的属性:
/** The names of all tokens. */ private Name[] tokenName = new Name[values().length];
初始化时填充tokenName,代码如下:
private void enterKeyword(String s, Token token) { Name n = names.fromString(s); tokenName[token.ordinal()] = n; if (n.getIndex() > maxKey) { maxKey = n.getIndex(); } }
则数组的值为:
...
因为有tokenName的枚举常量其ordinal从3开始,到109结束。
然后就可以借助tokenName来完成name到Token的映射了,涉及到的属性如下:
/** * Keyword array. Maps name indices to Token. */ private final Token[] key; /** The number of the last entered keyword. */ private int maxKey = 0;
填充key的属性代码如下:
protected Keywords(Context context) { // ... key = new Token[maxKey+1]; for (int i = 0; i <= maxKey; i++) { key[i] = IDENTIFIER; } for (Token t : values()) { if (t.name != null) { int oi = t.ordinal(); int ti = tokenName[oi].getIndex(); key[ti] = t; } } }
maxKey值为2905。key中的下标为Name的index值,而值就是Token。其值如下:
2630=abstract 2638=assert 1195=boolean 2644=break 1054=byte 2649=case 2653=catch 1092=char 63=class 2658=const 2663=continue 56=default 2671=do 1173=double 2673=else 2677=enum 2681=extends 2688=final 2693=finally 1153=float 2700=for 2703=goto 2707=if 2709=implements 2719=import 2725=instanceof 1115=int 2735=interface 1135=long 2744=native 2750=new 2753=package 2760=private 2767=protected 2776=public 2782=return 1072=short 2788=static 2794=strictfp 51=super 2802=switch 2808=synchronized 47=this 2820=throw 2825=throws 2831=transient 2840=try 1219=void 2843=volatile 2851=while 2856=true 2860=false 2865=null 2869=( 2870=) 2871={ 2872=} 2873=[ 2874=] 45=; 44=, 43=. 2875=... 2878== 2598=> 2597=< 2574=! 2569=~ 2879=? 2880=: 2603=== 2599=<= 2601=>= 2605=!= 2607=&& 2609=|| 2570=++ 2572=-- 2568=+ 1=- 46=* 0=/ 2587=& 2588=| 2589=^ 2586=% 2590=<< 2592=>> 2594=>>> 2881=+= 2883=-= 2885=*= 3=/= 2887=&= 2889=|= 2891=^= 2893=%= 2895=<<= 2898=>>= 2901=>>>= 2905=@