JFlex/BYaccJ and Encoding Issues
Did you come through a problem of scanning a file/data retreived from a platform with encoding CP1252 and opened using a Java Reader with default encoding (that could be different from Cp1252) ? These kind of scenarios sometimes could lead to failure in scanning?
How far is it write to open a Java stream with default platform encoding?
What if a data with UTF-16BE/UTF-16LE encoding is read by Java IO Reader in platform default encoding of UTF-8? In that case, would the BOM character be properly read? Or would it get corrupted? Coz' \UFFFE\UFFFF is not recognized as valid UTF-8 characters by Java Reader? Java would convert them as a replacement character \UFFFD which could enter bug in the scanning process? So, whats should be method to read it appropriately?
That is pretty interesting with Java Lexers/Scanners!
One could write a java class to identify data's encoding. UTF-16LE/UTF-16BE is identified by the presence of BOM(Byte Order mark) characters in the begining of the data stream. UTF-8 could also be identified by BOM but BOM characters in UTF-8 is optional.
What is BOM?
The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also represent a Zero Width No-break Space.) The code point U+FFFE is illegal in Unicode, and should never appear in a Unicode character stream. Therefore the BOM can be used in the first character of a file (or more generally a string), as an indicator of endian-ness. With UTF-16, if the first character is read as bytes FE FF then the text has the same endian-ness as the machine reading it. If the character is read as bytes FF FE, then the endian-ness is reversed and all 16-bit words should be byte-swapped as they are read-in. In the same way, the BOM indicates the endian-ness of text encoded with UTF-32.
A byte-order mark is not a control character that selects the byte order of the text; it simply informs an application receiving the file that the file is byte ordered.
Following is a list of BOM characters for various different encoding:
UTF-8 EFBBBF
UTF-16LE FEFF
UTF-16BE FFFE
UTF-32LE FFFE 0000
UTF-32BE 0000 FEFF
One could get the data in form of bytes. Bytes could be easily read to examine the presence of BOM characters.
Do read about
UTF and BOM in this link.
In the meantime, enjoy reading a very nice tutorial for beginners -
Introduction to Compilers
Let me know if you have any specific questions on encoding issues vis-a-vis lexers!