org.comedia.util.scanner
Class CScanner

java.lang.Object
  |
  +--org.comedia.util.scanner.CScanner
Direct Known Subclasses:
CCppScanner, CPasScanner, CXmlScanner

public class CScanner
extends java.lang.Object

Abstract class for different specific lexical scanners. Class provides general functionality and does not support specific keywords or datatypes.

Example of scanner usage:

 System.out.println("*********** Scanner Test *************");

 CScanner scanner = new CScanner();
 scanner.setBuffer("while(1.0e2*i := \t\r\n> \"string\'\'\")\n"
   + "// comment\n/.*second\ncomment*./{xxx}");
 scanner.setShowEol(true);
 scanner.setShowSpace(true);

 // Testing string convertions
 String str = "The test \"string\"";
 System.out.println("Start string: " + str);
 str = scanner.wrapString(str);
 System.out.println("Wrapped string: " + str);
 str = scanner.unwrapString(str);
 System.out.println("Unwrapped string: " + str);

 System.out.println();
 System.out.println("Initial string: " + scanner.getBuffer());

 while (scanner.lex() != EOF) {
   switch (scanner.getTokenType()) {
     case UNKNOWN: System.out.print("Type: Unknown "); break;
     case COMMENT: System.out.print("Type: Comment "); break;
     case KEYWORD: System.out.print("Type: Keyword "); break;
     case TYPE: System.out.print("Type: Type "); break;
     case IDENT: System.out.print("Type: Ident "); break;
     case ALPHA: System.out.print("Type: Alpha "); break;
     case OPERATOR: System.out.print("Type: Operator "); break;
     case BRACE: System.out.print("Type: Brace "); break;
     case SEPARATOR: System.out.print("Type: Separator "); break;
     case EOL: System.out.print("Type: Eol "); break;
     case LF: System.out.print("Type: Lf "); break;
     case SPACE: System.out.print("Type: Space "); break;
     case INT: System.out.print("Type: Int "); break;
     case FLOAT: System.out.print("Type: Float "); break;
     case STRING: System.out.print("Type: String "); break;
     case BOOL: System.out.print("Type: Bool "); break;
     case EOF: System.out.print("Type: Eof "); break;
   }
   System.out.println("Value: '" + scanner.getToken()
     + "' Pos: " + scanner.getPosition() + " Line: " + scanner.getLineNo());
 }
 
The result:

 *********** Scanner Test *************
 Start string: The test "string"
 Wrapped string: "The test "string""
 Unwrapped string: The test "string"

 Initial string: while(1.0e2*i :=
 > "string''")
 // comment
 /.second
 comment./{xxx}
 Type: Ident Value: 'while' Pos: 0 Line: 0
 Type: Brace Value: '(' Pos: 5 Line: 0
 Type: Float Value: '1.0e2' Pos: 6 Line: 0
 Type: Operator Value: '*' Pos: 11 Line: 0
 Type: Ident Value: 'i' Pos: 12 Line: 0
 Type: Space Value: ' ' Pos: 13 Line: 0
 Type: Separator Value: ':' Pos: 14 Line: 0
 Type: Operator Value: '=' Pos: 15 Line: 0
 Type: Space Value: ' 	' Pos: 16 Line: 0
 Type: Lf Value: '
 ' Pos: 18 Line: 0
 Type: Eol Value: '
 ' Pos: 19 Line: 0
 Type: Operator Value: '>' Pos: 20 Line: 1
 Type: Space Value: ' ' Pos: 21 Line: 1
 Type: String Value: '"string''"' Pos: 22 Line: 1
 Type: Brace Value: ')' Pos: 32 Line: 1
 Type: Eol Value: '
 ' Pos: 33 Line: 1
 Type: Operator Value: '/' Pos: 34 Line: 2
 Type: Operator Value: '/' Pos: 35 Line: 2
 Type: Space Value: ' ' Pos: 36 Line: 2
 Type: Ident Value: 'comment' Pos: 37 Line: 2
 Type: Eol Value: '
 ' Pos: 44 Line: 2
 Type: Operator Value: '/' Pos: 45 Line: 3
 Type: Operator Value: '*' Pos: 46 Line: 3
 Type: Ident Value: 'second' Pos: 47 Line: 3
 Type: Eol Value: '
 ' Pos: 53 Line: 3
 Type: Ident Value: 'comment' Pos: 54 Line: 4
 Type: Operator Value: '*' Pos: 61 Line: 4
 Type: Operator Value: '/' Pos: 62 Line: 4
 Type: Brace Value: '{' Pos: 63 Line: 4
 Type: Ident Value: 'xxx' Pos: 64 Line: 4
 Type: Brace Value: '}' Pos: 67 Line: 4
 


Inner Class Summary
protected  class CScanner.Lexem
          Presents extracted token with information about token type and position in input stream.
 
Field Summary
static int ALPHA
          Constant which covers COMMENT, KEYWORD, TYPE or IDENT tokens.
static int BOOL
          Boolean constant token.
static int BRACE
          Different brace token constant.
protected  java.lang.String buffer
          Buffer which contains input stream.
protected  int bufferLen
          The length of the input stream.
protected  int bufferLine
          Current precessed line in the input stream.
protected  int bufferPos
          Pointer to current position in the input stream.
static int COMMENT
          Comment string token constant.
static int CONST
          Constant which covers all token constants: INT, FLOAT, STRING and BOOL
protected  CScanner.Lexem current
          "Holder" class which contains current extracted token.
static int DELIM
          Constant which covers OPERATOR, BRACE, SEPARATOR, EOL, LN and SPACE tokens.
static int EOF
          End-Of-File token constant.
static int EOL
          CHAR(13) token constant.
static int FLOAT
          Float constant token.
static int IDENT
          Identifier token constant.
static int INT
          Integer constant token.
static int KEYWORD
          Keyword token constant.
protected  java.lang.String[] keywords
          List of language specified reserved keywords.
static int LF
          CHAR(10) token constant.
protected  CScanner.Lexem next
          "Holder" class which contains next available token.
static int OPERATOR
          Operator token constant.
protected  java.lang.String[] operators
          List of language specified operators.
static int SEPARATOR
          Different lexem seperators token constant.
protected  boolean showComment
          It means show or hide comment tokens.
protected  boolean showEol
          It means show or hide EOL/LF tokens.
protected  boolean showKeyword
          It shows do make a search for keywords or present them as identifiers.
protected  boolean showSpace
          It means show or hide space tokens.
protected  boolean showString
          It shows how to present extracted string tokens: in ordinal or escape format.
protected  boolean showType
          It shows do make a search for data type keywords or present them as identifiers.
static int SPACE
          Space token constant.
static int STRING
          String constant token.
static int TYPE
          Data type keyword token constant.
protected  java.lang.String[] types
          List of language specified data type keywords.
static int UNKNOWN
          Unknown token constant.
 
Constructor Summary
CScanner()
          Default class constructor.
 
Method Summary
protected  void extractNextToken()
          Extracts "next" token from the input stream.
protected  void extractToken()
          Extract "current" token or copies it from "next" token if it is available.
 java.lang.String getBuffer()
          Gets an input buffer string.
 int getBufferPos()
          Gets a current position in the input stream.
 int getLineNo()
          Gets a line number of the first character of a current token.
 int getNextLineNo()
          Gets a line number of the first character of a next token.
 int getNextPosition()
          Gets position ot the first character of a next token in the input stream.
 java.lang.String getNextToken()
          Gets a next token value.
 int getNextTokenType()
          Gets a next token type represented by special constant.
 int getPosition()
          Gets position ot the first character of a current token in the input stream.
 java.lang.String getToken()
          Gets a current token value.
 int getTokenType()
          Gets a current token type represented by special constant.
 int gotoNextToken()
          Continues the parsing process and extracts a current token.
protected  int innerProcCComment(CScanner.Lexem curr)
          Parses C-like multi-line comment.
protected  int innerProcCString(CScanner.Lexem curr)
          Parses C-like escape string.
protected  int innerProcIdent(CScanner.Lexem curr)
          Parses an identificator or numeric constant tokens.
protected  int innerProcLineComment(CScanner.Lexem curr)
          Processes the rest single-line comment.
protected  int innerProcPasString(CScanner.Lexem curr)
          Parses Pascal-like escape string.
protected  int innerProcString(CScanner.Lexem curr)
          Parses a string.
protected  int innerStartLex(CScanner.Lexem curr)
          Starts the first stage of lexical parsing.
static boolean isAlpha(char c)
          Checks is character an alpha.
static boolean isDelim(char c)
          Checks is character a delimiter.
static boolean isDigit(char c)
          Checks is character a digit.
static boolean isEol(char c)
          Checks is character EOL (CHAR(13) symbol.
static boolean isLetter(char c)
          Checks is character a letter.
static boolean isQuote(char c)
          Checks is character a quote.
 boolean isShowComment()
          Gets a ShowComment property value.
 boolean isShowEol()
          Gets a ShowEol property value.
 boolean isShowKeyword()
          Gets a ShowKeyword property value.
 boolean isShowSpace()
          Gets a ShowSpace property value.
 boolean isShowString()
          Gets a ShowString property value.
 boolean isShowType()
          Gets a ShowType property value.
static boolean isWhite(char c)
          Checks is character a white space.
 int lex()
          Starts the parsing process and extract a current token.
protected  int lowRunLex(CScanner.Lexem curr)
          Gets a lowlevel token.
static void main(java.lang.String[] args)
          The main function for test purposes.
 void restart()
          Restarts the parsing process by reassinging the same input buffer.
protected  int runLex(CScanner.Lexem curr)
          Extracts next available token from the input stream.
protected  boolean searchForString(java.lang.String s, java.lang.String[] a)
          Searches a string value inside a string array.
 void setBuffer(java.lang.String s)
          Sets a new input buffer and resets buffer pointers.
 void setShowComment(boolean value)
          Sets a new ShowComment property value.
 void setShowEol(boolean value)
          Sets a new ShowEol property value.
 void setShowKeyword(boolean value)
          Sets a new ShowKeyword property value.
 void setShowSpace(boolean value)
          Sets a new ShowSpace property value.
 void setShowString(boolean value)
          Sets a new ShowString property value.
 void setShowType(boolean value)
          Sets a new ShowType property value.
static java.lang.String unwrapString(java.lang.String s)
          Converts a string from special escape format limited with quotes into oridinary (local) presentation.
static java.lang.String wrapString(java.lang.String s)
          Converts a string from ordinary into escape format limited with quotes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNKNOWN

public static final int UNKNOWN
Unknown token constant.

COMMENT

public static final int COMMENT
Comment string token constant.

KEYWORD

public static final int KEYWORD
Keyword token constant. It depends on supported language syntax.

TYPE

public static final int TYPE
Data type keyword token constant. It depends on supported language syntax.

IDENT

public static final int IDENT
Identifier token constant. Rules depends on supported language syntax.

ALPHA

public static final int ALPHA
Constant which covers COMMENT, KEYWORD, TYPE or IDENT tokens.

OPERATOR

public static final int OPERATOR
Operator token constant. It depends on supported language syntax. Can be simple ('-', '=' or '%'), complex ('>=', '<=') or presented by keyword ('or', 'and').

BRACE

public static final int BRACE
Different brace token constant. Braces can be square '[]', round '()' or curve '{}'.

SEPARATOR

public static final int SEPARATOR
Different lexem seperators token constant. Separators are presented by ',', '.', ':' or ';' characters depends on supported language syntax.

EOL

public static final int EOL
CHAR(13) token constant.

LF

public static final int LF
CHAR(10) token constant.

SPACE

public static final int SPACE
Space token constant. Token can consist with spaces and tab symbols.

DELIM

public static final int DELIM
Constant which covers OPERATOR, BRACE, SEPARATOR, EOL, LN and SPACE tokens.

INT

public static final int INT
Integer constant token.

FLOAT

public static final int FLOAT
Float constant token.

STRING

public static final int STRING
String constant token. Escape symbols depend on supported language syntax.

BOOL

public static final int BOOL
Boolean constant token.

CONST

public static final int CONST
Constant which covers all token constants: INT, FLOAT, STRING and BOOL

EOF

public static final int EOF
End-Of-File token constant. It means end of input stream and end of parsing process.

buffer

protected java.lang.String buffer
Buffer which contains input stream.

bufferPos

protected int bufferPos
Pointer to current position in the input stream.

bufferLine

protected int bufferLine
Current precessed line in the input stream.

bufferLen

protected int bufferLen
The length of the input stream.

current

protected CScanner.Lexem current
"Holder" class which contains current extracted token.

next

protected CScanner.Lexem next
"Holder" class which contains next available token.

showComment

protected boolean showComment
It means show or hide comment tokens. It is FALSE by default.

showString

protected boolean showString
It shows how to present extracted string tokens: in ordinal or escape format. TRUE means to present string in escape format. It is TRUE by default.

showEol

protected boolean showEol
It means show or hide EOL/LF tokens. It is FALSE by default.

showKeyword

protected boolean showKeyword
It shows do make a search for keywords or present them as identifiers. It is TRUE by default.

showType

protected boolean showType
It shows do make a search for data type keywords or present them as identifiers. It is TRUE by default.

showSpace

protected boolean showSpace
It means show or hide space tokens. It is FALSE by default.

operators

protected java.lang.String[] operators
List of language specified operators.

types

protected java.lang.String[] types
List of language specified data type keywords.

keywords

protected java.lang.String[] keywords
List of language specified reserved keywords.
Constructor Detail

CScanner

public CScanner()
Default class constructor.
Method Detail

lowRunLex

protected int lowRunLex(CScanner.Lexem curr)
Gets a lowlevel token. Presents the main parsing process and should be overrided by specific language scanners.
Parameters:
curr - a "Holder" which containes extracted token.

runLex

protected int runLex(CScanner.Lexem curr)
Extracts next available token from the input stream. It can extract "current" or "next" token depends iplementation logic.
Parameters:
curr - a "Holder" class which contains extracted token.

extractToken

protected void extractToken()
Extract "current" token or copies it from "next" token if it is available.

extractNextToken

protected void extractNextToken()
Extracts "next" token from the input stream.

innerStartLex

protected int innerStartLex(CScanner.Lexem curr)
Starts the first stage of lexical parsing. It initializes a token and skips white spaces from the current position of the input stream. Each parsing process should begins with this method.
Parameters:
curr - a "Holder" class which contains an extracting token.

innerProcLineComment

protected int innerProcLineComment(CScanner.Lexem curr)
Processes the rest single-line comment. The method can be used in defferent lexical procedures to skip the rest of the line.
Parameters:
curr - a "Holder" class whci contains an extracting token.

innerProcCComment

protected int innerProcCComment(CScanner.Lexem curr)
Parses C-like multi-line comment. Comment starts with '/.*' and ends with '*./'.
Parameters:
curr - a "Holder" class whci contains an extracting token.

innerProcIdent

protected int innerProcIdent(CScanner.Lexem curr)
Parses an identificator or numeric constant tokens. Indetifier starts with alpha and can contain alphas, digits or special characters like '_$'. Numeric contants stars with digit and can contain except digit also alphas. Float contants can also contain colon symbol ('.') inside.
Parameters:
curr - a "Holder" class which contains an extracting token.

innerProcString

protected int innerProcString(CScanner.Lexem curr)
Parses a string. String should be limited with single or double quotes.
Parameters:
curr - a "Holder" class which contains an extracting token.

innerProcCString

protected int innerProcCString(CScanner.Lexem curr)
Parses C-like escape string. String should be limited with double quotes ('"') and contain special characters in c-like escape format (CHAR(13) -> '\n', '"' -> '\"', etc).
Parameters:
curr - a "Holder" class which contains an extracting token.

innerProcPasString

protected int innerProcPasString(CScanner.Lexem curr)
Parses Pascal-like escape string. String should be limited with single quotes (''') and contain double all single quotes inside.
Parameters:
curr - a "Holder" class which contains an extracting token.

searchForString

protected boolean searchForString(java.lang.String s,
                                  java.lang.String[] a)
Searches a string value inside a string array. This method is used for searching registered keywords, operators or data types.
Parameters:
s - a searching string value.
a - a string array.

restart

public void restart()
Restarts the parsing process by reassinging the same input buffer.

wrapString

public static java.lang.String wrapString(java.lang.String s)
Converts a string from ordinary into escape format limited with quotes. Escape format depends on supported language syntax.
Parameters:
s - a string in ordinary (local) presentation.

unwrapString

public static java.lang.String unwrapString(java.lang.String s)
Converts a string from special escape format limited with quotes into oridinary (local) presentation. Escape format depends on supported language syntax.
Parameters:
s - a string in special escape format.

isAlpha

public static boolean isAlpha(char c)
Checks is character an alpha. Alpha means some letter which is not a white space, delimiter or digit.
Parameters:
c - a checking character.

isLetter

public static boolean isLetter(char c)
Checks is character a letter. Letters can be only from latin alphabet and do not include letters from other alphabets.
Parameters:
c - a checking character.

isDigit

public static boolean isDigit(char c)
Checks is character a digit.

isDelim

public static boolean isDelim(char c)
Checks is character a delimiter.
Parameters:
c - a checking character.

isWhite

public static boolean isWhite(char c)
Checks is character a white space.
Parameters:
c - a checking character.

isEol

public static boolean isEol(char c)
Checks is character EOL (CHAR(13) symbol.
Parameters:
c - a checking character.

isQuote

public static boolean isQuote(char c)
Checks is character a quote. Quote are represented by ''' or '"' characters.
Parameters:
c - a checking character.

lex

public int lex()
Starts the parsing process and extract a current token.

gotoNextToken

public int gotoNextToken()
Continues the parsing process and extracts a current token. It means the same as lex method.

isShowComment

public boolean isShowComment()
Gets a ShowComment property value. ShowComment means show or hide comment tokens. It is FALSE by default.

setShowComment

public void setShowComment(boolean value)
Sets a new ShowComment property value. ShowComment means show or hide comment tokens. It is FALSE by default.
Parameters:
value - a new ShowComment property value.

isShowEol

public boolean isShowEol()
Gets a ShowEol property value. ShowEol means show or hide EOL/LF tokens. It is FALSE by default.

setShowEol

public void setShowEol(boolean value)
Sets a new ShowEol property value. ShowEol means show or hide EOL/LF tokens. It is FALSE by default.
Parameters:
value - a new ShowEol property value.

isShowString

public boolean isShowString()
Gets a ShowString property value. ShowString shows how to present extracted string tokens: in ordinal or escape format. TRUE means to present string in escape format. It is TRUE by default.

setShowString

public void setShowString(boolean value)
Sets a new ShowString property value. ShowString shows how to present extracted string tokens: in ordinal or escape format. TRUE means to present string in escape format. It is TRUE by default.
Parameters:
value - a new ShowString property value.

isShowKeyword

public boolean isShowKeyword()
Gets a ShowKeyword property value. ShowKeyword shows do make a search for keywords or present them as identifiers. It is TRUE by default.

setShowKeyword

public void setShowKeyword(boolean value)
Sets a new ShowKeyword property value. ShowKeyword shows do make a search for keywords or present them as identifiers. It is TRUE by default.
Parameters:
value - a new ShowKeyword property value.

isShowType

public boolean isShowType()
Gets a ShowType property value. ShowType shows do make a search for data type keywords or present them as identifiers. ShowType is TRUE by default.

setShowType

public void setShowType(boolean value)
Sets a new ShowType property value. ShowType shows do make a search for data type keywords or present them as identifiers. ShowType is TRUE by default.
Parameters:
value - a new ShowType property value.

isShowSpace

public boolean isShowSpace()
Gets a ShowSpace property value. ShowSpace means show or hide space tokens. It is FALSE by default.

setShowSpace

public void setShowSpace(boolean value)
Sets a new ShowSpace property value. ShowSpace means show or hide space tokens. It is FALSE by default.
Parameters:
value - a new ShowSpace property value.

getBuffer

public java.lang.String getBuffer()
Gets an input buffer string.

setBuffer

public void setBuffer(java.lang.String s)
Sets a new input buffer and resets buffer pointers.
Parameters:
s - a new input stream.

getBufferPos

public int getBufferPos()
Gets a current position in the input stream.

getPosition

public int getPosition()
Gets position ot the first character of a current token in the input stream.

getLineNo

public int getLineNo()
Gets a line number of the first character of a current token.

getToken

public java.lang.String getToken()
Gets a current token value.

getTokenType

public int getTokenType()
Gets a current token type represented by special constant.

getNextPosition

public int getNextPosition()
Gets position ot the first character of a next token in the input stream.

getNextLineNo

public int getNextLineNo()
Gets a line number of the first character of a next token.

getNextToken

public java.lang.String getNextToken()
Gets a next token value.

getNextTokenType

public int getNextTokenType()
Gets a next token type represented by special constant.

main

public static void main(java.lang.String[] args)
The main function for test purposes.