Saturday, May 08, 2004

Compiler Vs Lex/Yacc...



What is a compiler? How is it related to Lex/Yacc?

Lex/Yacc plays a very important role in Compiler construction.

Compiler is basically a tool that translates source code (hig level code) to machine code (low level code). Thus, a source code is feeded to a compiler. Compiler scans the cource code and converts into machine code.

Source code is written using some programming language. Programming languages are used to speak with computers in the similar ways as we use natural language to interact with each other. Compiler consists of following components:

Lexer
Parser
Semantic analyser
Source code optimizer
Code generation
Code optimizer

Thus, a source code is taken through above 6 components to be converted into machine code.

Lexical analysis, and syntax analysis is influenced by Lex/Yacc. Following shows the lex/Yacc of an expression:

y = i*5

Above is scanned aand cconverted into tokens y, =, i, *, 5.
Parser does the syntax analysis of above in following manner:

EXPRESSION: EXPRESSION ASSIGN_EXPRESSION MULTIPLICATIVE_EXPRESSION

Watch out for next article very soon!

Friday, May 07, 2004

JFlex/BYaccJ and Encoding Issues



Did you come through a problem of scanning a file/data retreived from a platform with encoding CP1252 and opened using a Java Reader with default encoding (that could be different from Cp1252) ? These kind of scenarios sometimes could lead to failure in scanning?

How far is it write to open a Java stream with default platform encoding?

What if a data with UTF-16BE/UTF-16LE encoding is read by Java IO Reader in platform default encoding of UTF-8? In that case, would the BOM character be properly read? Or would it get corrupted? Coz' \UFFFE\UFFFF is not recognized as valid UTF-8 characters by Java Reader? Java would convert them as a replacement character \UFFFD which could enter bug in the scanning process? So, whats should be method to read it appropriately?

That is pretty interesting with Java Lexers/Scanners!

One could write a java class to identify data's encoding. UTF-16LE/UTF-16BE is identified by the presence of BOM(Byte Order mark) characters in the begining of the data stream. UTF-8 could also be identified by BOM but BOM characters in UTF-8 is optional.

What is BOM?

The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also represent a Zero Width No-break Space.) The code point U+FFFE is illegal in Unicode, and should never appear in a Unicode character stream. Therefore the BOM can be used in the first character of a file (or more generally a string), as an indicator of endian-ness. With UTF-16, if the first character is read as bytes FE FF then the text has the same endian-ness as the machine reading it. If the character is read as bytes FF FE, then the endian-ness is reversed and all 16-bit words should be byte-swapped as they are read-in. In the same way, the BOM indicates the endian-ness of text encoded with UTF-32.
A byte-order mark is not a control character that selects the byte order of the text; it simply informs an application receiving the file that the file is byte ordered.

Following is a list of BOM characters for various different encoding:

UTF-8 EFBBBF
UTF-16LE FEFF
UTF-16BE FFFE
UTF-32LE FFFE 0000
UTF-32BE 0000 FEFF


One could get the data in form of bytes. Bytes could be easily read to examine the presence of BOM characters.

Do read about UTF and BOM in this link.


In the meantime, enjoy reading a very nice tutorial for beginners - Introduction to Compilers

Let me know if you have any specific questions on encoding issues vis-a-vis lexers!

Wednesday, May 05, 2004

Some interesting applications relating to Lex/Yacc



Following are some of the applications that would give some idea of how to use lexer/parser generators:

Converting Lexacy Data to XML using a Lexer/Parser Generator : You could get fair enough idea of JFlex/CUP and its use for making lexer/parser generator.

Another application is Using a Lexer/Parser Generator as a Multipurpose XML Tool Builder

Tuesday, May 04, 2004

Object oriented lexing and parsing



How about writing an object oriented lexer and parser generator? How useful is it? What are the available tools? Lets look at it one-by-one!

How about implementing lexer/parsers in C/C++? Check out the a useful tutorial on the following link: Lexer/Parser in C++

The above link would give you an first hand insight into how C Compiler lex and yacc the following program:

int main()
{
printf("Hello World");
return 0;
}

Check out the link Making a Parser in C++