Tokenizing text with a StreamTokenizer

Instantiate a StreamTokenizer, pass it a Reader instance and loop through the available tokens with nextToken. This method returns an integer that refers to the type of token that was read. These are the possibilities:

   TT_EOF		end of file
   TT_EOL		end of line
   TT_NUMBER	numeric value, the actual value is stored in nval
   TT_WORD		word value, the actual value is stored in sval
   "			a quoted string, the actual value is stored in sval
   x			a character token, x replaced by the character converted to an int

This simple example shows you how to read in a text file and print out its tokens.

Main.java:

import java.io.*;
 
public class Main
{
   public static void main(String []args) throws Exception{
      if (args.length != 1) {
         System.out.println("Usage: java Main <file>");
         System.exit(1);
      }
 
      BufferedReader br = new BufferedReader(new FileReader(args[0]));
      StreamTokenizer st = new StreamTokenizer(br);
 
      int t = st.nextToken();
      while (t != StreamTokenizer.TT_EOF) {
         switch(t) {
            case StreamTokenizer.TT_EOL:
               System.out.println("TT_EOL");
               break;
            case StreamTokenizer.TT_NUMBER:
               System.out.println("TT_NUMBER: " + st.nval);
               break;
            case StreamTokenizer.TT_WORD:
               System.out.println("TT_WORD: " + st.sval);
               break;
            case '"':
               System.out.println("quoted string: " + st.sval);
               break;
            default:
               System.out.println("tokentype: " + (char) t);
         }
  
         t = st.nextToken();
      }
   }
}

If we run it on the following text file:

/* 
 * simple program in Java
 */
 
public class Main {
   public static void Main(String []args) {
      // make calculation
      int a = 4 / 2;
 
      System.out.println("result: " + a);
   }
}

it produces the following result.

tokentype: *
TT_WORD: simple
TT_WORD: program
TT_WORD: in
TT_WORD: Java
tokentype: *
TT_WORD: public
TT_WORD: class
TT_WORD: Main
tokentype: {
TT_WORD: public
TT_WORD: static
TT_WORD: void
TT_WORD: Main
tokentype: (
TT_WORD: String
tokentype: [
tokentype: ]
TT_WORD: args
tokentype: )
tokentype: {
TT_WORD: int
TT_WORD: a
tokentype: =
TT_NUMBER: 4.0
TT_WORD: System.out.println
tokentype: (
quoted string: result: 
tokentype: +
TT_WORD: a
tokentype: )
tokentype: ;
tokentype: }
tokentype: }

Notice that /* , / , // and whitespace seem to be left out! In addition, anything that comes after a / is left out too! The reason for this is that StreamTokenizer has a initial setup:

   - 'A' to 'Z' and 'a' to 'z' and u00A0 till u00FF
	are considered wordchars
   - u0000 till u0020 is considered whitespace
   - / is a comment character
   - ' and " are considered quote characters
   - Numbers are parsed (notice 4 has become 4.0)
   - EOL is considered whitespace
   - C/C++ comments are not recognized.

You can customize the StreamTokenizer in a number of ways:

1. wordChars(int lo, int hi)

The lo and hi parameters specify the unicode range of characters that you would like to see treated as part of a word. You can call this method several times to include several ranges. Try this after you have instantiated the StreamTokenizer:

      // consider all values in the range '{' and '}' as whitespace
      st.wordChars('{', '}');

2. whitespaceChars(int lo, int hi)

The lo and hi parameters specify the unicode range of characters that you would like to see treated as whitespace. You can call this method several times to include several ranges. Try this:

      // consider all values in the range '{' and '}' as whitespace
      st.whitespaceChars('{', '}'); 

3. ordinaryChars(int lo, int hi)

The lo and hi parameters specify the unicode range of characters that you would like to see treated as being an ordinary character, meaning it’s not part of a word, number, whitespace, etc. It will be returned by nextToken as a single character. There’s a variation on this method that takes only one parameter. Try this:

      // consider all values in the range 'a' to 'g' as ordinary char
      st.ordinaryChars('a', 'g');

4. commentChar(int ch)

Specifies that the value ch should be treated as a comment character, meaning the character plus the rest of the line is ignored. Try this:

      // treat 'p' as being a comment
      st.commentChar('p');

5. quoteChar(int ch)

Tells the tokenizer that all characters between this delimiter ch are treated as a string constant. Try this:

      st.quoteChar('/');

6. parseNumbers

This tells the tokenizer that characters from 0 to 9, the period and the minus sign should be recognized as being part of a TT_NUMBER token, if it can be constructed. By default, parseNumbers is set. You can have . and – treated otherwise but then you would have to use the methods ordinaryChar or wordChars.

7. eolIsSignificant(boolean b)

If b is set, TT_EOL will be returned whenever an end-of-line is encountered. Otherwise, they are ignored. Try this:

      st.eolIsSignificant(true);

8. slashStarComments(boolean b)

If b is set, all characters between /* and */ are ignored (C style comments)

9. slashSlashComments(boolean b)

If b is set, // is recognized as being comments (the rest of the line is ignored). (C++ style comments)

10. lowerCaseMode(boolean lc)

if lc is set, all word tokens are lowercased when returned.

11. pushBack()

“Pushes” the last token that was returned back on the stream. Next time nextToken is invoked, the same token will be returned as the last one.

Then there’s another member variable lineno that you may invoke at any time to get the current linenumber.