程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
 程式師世界 >> 編程語言 >> JAVA編程 >> 關於JAVA >> Java字符編碼解碼的完成詳解

Java字符編碼解碼的完成詳解

編輯:關於JAVA

Java字符編碼解碼的完成詳解。本站提示廣大學習愛好者:(Java字符編碼解碼的完成詳解)文章只能為提供參考,不一定能成為您想要的結果。以下是Java字符編碼解碼的完成詳解正文


 字符集基本:

Character set(字符集)
         字符的聚集,也就是,帶有特別語義的符號。字母“A”是一個字符。“%”也是一個字符。沒有內涵數字價值,與 ASC II ,Unicode,乃至是電腦也沒有任何的直接接洽。在電腦發生前的很長一段時光內,符號就曾經存在了。
Coded character set(編碼字符集)
         一個數值賦給一個字符的聚集。把代碼賦值給字符,如許它們便可以用特定的字符編碼集表達數字的成果。其他的編碼字符集可以賦分歧的數值到統一個字符上。字符集映照平日是由尺度組織肯定的,例如 USASCII ,ISO 8859 -1,Unicode (ISO 10646 -1) ,和 JIS X0201。
Character-encoding scheme(字符編碼計劃)
         編碼字符集成員到八位字節(8 bit 字節)的映照。編碼計劃界說了若何把字符編碼的序列表達為字節序列。字符編碼的數值不須要與編碼字節雷同,也不須要是一對一或一對多個的關系。准繩上,把字符集編碼息爭碼近似視為對象的序列化和反序列化。


平日字符數據編碼是用於收集傳輸或文件存儲。編碼計劃不是字符集,它是映照;然則由於它們之間的慎密接洽,年夜部門編碼都與一個自力的字符集相干聯。例如,UTF -8,
僅用來編碼 Unicode字符集。雖然如斯,用一個編碼計劃處置多個字符集照樣能夠產生的。例如,EUC 可以對幾個亞洲說話的字符停止編碼。
圖6-1 是應用 UTF -8 編碼計劃將 Unicode字符序列編碼為字節序列的圖形表達式。UTF -8把小於 0x80 的字符代碼值編碼成一個單字節值(尺度 ASC II )。一切其他的 Unicode字符都被編碼成 2 到6 個字節的多字節序列(http://www.ietf.org/rfc/rfc2279.txt )。

Charset(字符集)
       術語 charset 是在RFC2278(http://ietf.org/rfc/rfc2278.txt) 中界說的。它是編碼字符集 和字符編碼計劃的聚集。java.nio.charset 包的類是 Charset,它封裝字符集抽取。
1111111111111111
 Unicode是16-位字符編碼。它試著把全球一切說話的字符集同一到一個自力的、周全的映照中。它博得了一席之地,然則今朝仍有很多其他字符編碼正在被普遍的應用。
年夜部門的操作體系在 I/O 與文件存儲方面還是以字節為導向的,所以不管應用何種編碼,Unicode或其他編碼,在字節序列和字符集編碼之間仍須要停止轉化。
由java.nio.charset 包構成的類知足了這個需求。這不是 Java 平台第一次處置字符集編碼,然則它是最體系、最周全、和最靈巧的處理方法。java.nio.charset.spi包供給辦事器供應接口(SPI),使編碼器息爭碼器可以依據須要選擇拔出。


字符集:在JVM 啟動時肯定默許值,取決於潛伏的操作體系情況、區域設置、和/或JVM設置裝備擺設。假如您須要一個指定的字符集,最平安的方法是明白的定名它。不要假定默許安排與您的開辟情況雷同。字符集稱號不辨別年夜小寫,也就是,當比擬字符集稱號時以為年夜寫字母和小寫字母雷同。互聯網稱號分派機構(IANA )保護一切正式注冊的字符集稱號。


示例6-1 演示了經由過程分歧的 Charset完成若何把字符翻譯成字節序列。
 
示例6 -1. 應用尺度字符集編碼

    package com.ronsoft.books.nio.charset; 

    import java.nio.charset.Charset; 
    import java.nio.ByteBuffer; 

    /**
     * Charset encoding test. Run the same input string, which contains some
     * non-ascii characters, through several Charset encoders and dump out the hex
     * values of the resulting byte sequences.
     * 
     * @author Ron Hitchens ([email protected])
     */ 
    public class EncodeTest { 
        public static void main(String[] argv) throws Exception { 
            // This is the character sequence to encode 
            String input = " \u00bfMa\u00f1ana?"; 
            // the list of charsets to encode with 
            String[] charsetNames = { "US-ASCII", "ISO-8859-1", "UTF-8", 
                    "UTF-16BE", "UTF-16LE", "UTF-16" // , "X-ROT13" 
            }; 
            for (int i = 0; i < charsetNames.length; i++) { 
                doEncode(Charset.forName(charsetNames[i]), input); 
            } 
        } 

        /**
         * For a given Charset and input string, encode the chars and print out the
         * resulting byte encoding in a readable form.
         */ 
        private static void doEncode(Charset cs, String input) { 
            ByteBuffer bb = cs.encode(input); 
            System.out.println("Charset: " + cs.name()); 
            System.out.println("  Input: " + input); 
            System.out.println("Encoded: "); 
            for (int i = 0; bb.hasRemaining(); i++) { 
                int b = bb.get(); 
                int ival = ((int) b) & 0xff; 
                char c = (char) ival; 
                // Keep tabular alignment pretty 
                if (i < 10) 
                    System.out.print(" "); 
                // Print index number 
                System.out.print("  " + i + ": "); 
                // Better formatted output is coming someday... 
                if (ival < 16) 
                    System.out.print("0"); 
                // Print the hex value of the byte 
                System.out.print(Integer.toHexString(ival)); 
                // If the byte seems to be the value of a 
                // printable character, print it. No guarantee 
                // it will be. 
                if (Character.isWhitespace(c) || Character.isISOControl(c)) { 
                    System.out.println(""); 
                } else { 
                    System.out.println(" (" + c + ")"); 
                } 
            } 
            System.out.println(""); 
        } 
    } 

成果:

 Charset: US-ASCII 
  Input:  ?Ma?ana? 
Encoded:  
   0: 20 
   1: 3f (?) 
   2: 4d (M) 
   3: 61 (a) 
   4: 3f (?) 
   5: 61 (a) 
   6: 6e (n) 
   7: 61 (a) 
   8: 3f (?) 

Charset: ISO-8859-1 
  Input:  ?Ma?ana? 
Encoded:  
   0: 20 
   1: bf (?) 
   2: 4d (M) 
   3: 61 (a) 
   4: f1 (?) 
   5: 61 (a) 
   6: 6e (n) 
   7: 61 (a) 
   8: 3f (?) 

Charset: UTF-8 
  Input:  ?Ma?ana? 
Encoded:  
   0: 20 
   1: c2 (?) 
   2: bf (?) 
   3: 4d (M) 
   4: 61 (a) 
   5: c3 (?) 
   6: b1 (±) 
   7: 61 (a) 
   8: 6e (n) 
   9: 61 (a) 
  10: 3f (?) 

Charset: UTF-16BE 
  Input:  ?Ma?ana? 
Encoded:  
   0: 00 
   1: 20 
   2: 00 
   3: bf (?) 
   4: 00 
   5: 4d (M) 
   6: 00 
   7: 61 (a) 
   8: 00 
   9: f1 (?) 
  10: 00 
  11: 61 (a) 
  12: 00 
  13: 6e (n) 
  14: 00 
  15: 61 (a) 
  16: 00 
  17: 3f (?) 

Charset: UTF-16LE 
  Input:  ?Ma?ana? 
Encoded:  
   0: 20 
   1: 00 
   2: bf (?) 
   3: 00 
   4: 4d (M) 
   5: 00 
   6: 61 (a) 
   7: 00 
   8: f1 (?) 
   9: 00 
  10: 61 (a) 
  11: 00 
  12: 6e (n) 
  13: 00 
  14: 61 (a) 
  15: 00 
  16: 3f (?) 
  17: 00 

Charset: UTF-16 
  Input:  ?Ma?ana? 
Encoded:  
   0: fe (?) 
   1: ff (?) 
   2: 00 
   3: 20 
   4: 00 
   5: bf (?) 
   6: 00 
   7: 4d (M) 
   8: 00 
   9: 61 (a) 
  10: 00 
  11: f1 (?) 
  12: 00 
  13: 61 (a) 
  14: 00 
  15: 6e (n) 
  16: 00 
  17: 61 (a) 
  18: 00 
  19: 3f (?)

字符集類:

    package java.nio.charset;  
    public abstract class Charset implements Comparable  
    {  
            public static boolean isSupported (String charsetName)  
            public static Charset forName (String charsetName)  
            public static SortedMap availableCharsets()   
            public final String name()   
            public final Set aliases()  
            public String displayName()  
            public String displayName (Locale locale)   
            public final boolean isRegistered()   
            public boolean canEncode()   
            public abstract CharsetEncoder newEncoder();   
            public final ByteBuffer encode (CharBuffer cb)   
            public final ByteBuffer encode (String str)   
            public abstract CharsetDecoder newDecoder();   
            public final CharBuffer decode (ByteBuffer bb)   
            public abstract boolean contains (Charset cs);  
            public final boolean equals (Object ob)  
            public final int compareTo (Object ob)   
            public final int hashCode()  
            public final String toString()   
    } 

 那末Charset對象須要知足幾個前提:
 
  字符集的標准稱號應與在 IANA 注冊的稱號符合。
  假如IANA 用統一個字符集注冊了多個稱號,對象前往的標准稱號應當與 IANA 注冊中的MIME -首選稱號符合。
  假如字符集稱號從注冊中移除,那末以後的標准稱號應保存為別號。
  假如字符集沒有在 IANA 注冊,它的標准稱號必需以“X -”或“x-”開首。

年夜多半情形下,只要 JVM賣家才會存眷這些規矩。但是,假如您盤算以您本身的字符集作為運用的一部門,那末懂得這些不應做的工作將對您很有贊助。針對 isRegistered() 您應當前往 false 並以“X -”開首定名您的字符集。


字符集比擬:

    public abstract class Charset implements Comparable  
    {  
            // This is a partial API listing  
            public abstract boolean contains (Charset cs);   
            public final boolean equals (Object ob)  
            public final int compareTo (Object ob)   
            public final int hashCode()  
            public final String toString()   
    } 

回憶一下,字符集是由字符的編碼集與該字符集的編碼計劃構成的。與通俗的聚集相似,一個字符集能夠是另外一個字符集的子集。一個字符集(C 1)包括另外一個(C 2),表現在C 2 中表達的每一個字符都可以在 C 1 中停止雷同的表達。每一個字符集都被以為是包括其自己。假如這個包括關系成立,那末您在 C 2(被包括的子集)中編碼的隨意率性流在 C 1 中也必定可以編碼,無需任何調換。


字符集編碼器:字符集是由一個編碼字符集和一個相干編碼計劃構成的。CharsetEncoder 和CharsetDecoder 類完成轉換計劃。

 float averageBytesPerChar()  
          Returns the average number of bytes that will be produced for each character of input.  
 boolean canEncode(char c)  
          Tells whether or not this encoder can encode the given character.  
 boolean canEncode(CharSequence cs)  
          Tells whether or not this encoder can encode the given character sequence.  
 Charset charset()  
          Returns the charset that created this encoder.  
 ByteBuffer encode(CharBuffer in)  
          Convenience method that encodes the remaining content of a single input character buffer into a newly-allocated byte buffer.  
 CoderResult encode(CharBuffer in, ByteBuffer out, boolean endOfInput)  
          Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer.  
protected abstract  CoderResult encodeLoop(CharBuffer in, ByteBuffer out)  
          Encodes one or more characters into one or more bytes.  
 CoderResult flush(ByteBuffer out)  
          Flushes this encoder.  
protected  CoderResult implFlush(ByteBuffer out)  
          Flushes this encoder.  
protected  void implOnMalformedInput(CodingErrorAction newAction)  
          Reports a change to this encoder's malformed-input action.  
protected  void implOnUnmappableCharacter(CodingErrorAction newAction)  
          Reports a change to this encoder's unmappable-character action.  
protected  void implWordStrWith(byte[] newWordStrment)  
          Reports a change to this encoder's replacement value.  
protected  void implReset()  
          Resets this encoder, clearing any charset-specific internal state.  
 boolean isLegalWordStrment(byte[] repl)  
          Tells whether or not the given byte array is a legal replacement value for this encoder.  
 CodingErrorAction malformedInputAction()  
          Returns this encoder's current action for malformed-input errors.  
 float maxBytesPerChar()  
          Returns the maximum number of bytes that will be produced for each character of input.  
 CharsetEncoder onMalformedInput(CodingErrorAction newAction)  
          Changes this encoder's action for malformed-input errors.  
 CharsetEncoder onUnmappableCharacter(CodingErrorAction newAction)  
          Changes this encoder's action for unmappable-character errors.  
 byte[] replacement()  
          Returns this encoder's replacement value.  
 CharsetEncoder replaceWith(byte[] newWordStrment)  
          Changes this encoder's replacement value.  
 CharsetEncoder reset()  
          Resets this encoder, clearing any internal state.  
 CodingErrorAction unmappableCharacterAction()  
          Returns this encoder's current action for unmappable-character errors. 

CharsetEncoder 對象是一個狀況轉換引擎:字符出來,字節出來。一些編碼器的挪用能夠須要完成轉換。編碼器存儲在挪用之間轉換的狀況。

關於 CharsetEncoder API 的一個留意事項:起首,越簡略的encode() 情勢越便利,在從新分派的 ByteBuffer中您供給的 CharBuffer 的編碼集一切的編碼於一身。這是當您在 Charset類上直接挪用 encode() 時最初挪用的辦法。

Underflow(下溢)

Overflow (上溢)

Malformed input(出缺陷的輸出)

Unmappable character (無映照字符)


編碼時,假如編碼器遭受了出缺陷的或不克不及映照的輸出,前往成果對象。您也能夠檢測自力的字符,或許字符序列,來肯定它們能否能被編碼。上面是檢測可否停止編碼的辦法:

    package java.nio.charset;  
    public abstract class CharsetEncoder   
    {  
             // This is a partial API listing   
            public boolean canEncode (char c)   
            public boolean canEncode (CharSequence cs)  
    } 

 CodingErrorAction 界說了三個公共域:

REPORT (申報)
       創立 CharsetEncoder 時的默許行動。這個行動表現編碼毛病應當經由過程前往 CoderResult 對象
申報,後面提到過。

IGNORE (疏忽)
         表現應疏忽編碼毛病而且假如地位纰謬的話任何毛病的輸出都應中斷。

REPLACE(調換)
         經由過程中斷毛病的輸出並輸入針對該 CharsetEncoder 界說確當前的調換字節序列處置編碼毛病。

 

記住,字符集編碼把字符轉化成字節序列,為今後的解碼做預備。假如調換序列不克不及被解碼成有用的字符序列,編碼字節序列變成有效。

CoderResult類:CoderResult 對象是由 CharsetEncoder 和CharsetDecoder 對象前往的:

    package java.nio.charset;  
    public class CoderResult {  
            public static final CoderResult OVERFLOW  
            public static final CoderResult UNDERFLOW   
            public boolean isUnderflow()   
            public boolean isOverflow()  
    <span >  </span>public boolean isError()  
            public boolean isMalformed()   
            public boolean isUnmappable()  
            public int length()   
            public static CoderResult malformedForLength (int length)    
            public static CoderResult unmappableForLength (int length)   
    <span >  </span>public void throwException() throws CharacterCodingException  
    }  

字符集解碼器:字符集解碼器是編碼器的逆轉。經由過程特別的編碼計劃把字節編碼轉化成 16-位Unicode字符的序列。與 CharsetEncoder 相似的, CharsetDecoder 是狀況轉換引擎。兩個都不是線程平安的,由於挪用它們的辦法的同時也會轉變它們的狀況,而且這些狀況會被保存上去。

float averageCharsPerByte()  
          Returns the average number of characters that will be produced for each byte of input.  
 Charset charset()  
          Returns the charset that created this decoder.  
 CharBuffer decode(ByteBuffer in)  
          Convenience method that decodes the remaining content of a single input byte buffer into a newly-allocated character buffer.  
 CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)  
          Decodes as many bytes as possible from the given input buffer, writing the results to the given output buffer.  
protected abstract  CoderResult decodeLoop(ByteBuffer in, CharBuffer out)  
          Decodes one or more bytes into one or more characters.  
 Charset detectedCharset()  
          Retrieves the charset that was detected by this decoder  (optional operation).  
 CoderResult flush(CharBuffer out)  
          Flushes this decoder.  
protected  CoderResult implFlush(CharBuffer out)  
          Flushes this decoder.  
protected  void implOnMalformedInput(CodingErrorAction newAction)  
          Reports a change to this decoder's malformed-input action.  
protected  void implOnUnmappableCharacter(CodingErrorAction newAction)  
          Reports a change to this decoder's unmappable-character action.  
protected  void implWordStrWith(String newWordStrment)  
          Reports a change to this decoder's replacement value.  
protected  void implReset()  
          Resets this decoder, clearing any charset-specific internal state.  
 boolean isAutoDetecting()  
          Tells whether or not this decoder implements an auto-detecting charset.  
 boolean isCharsetDetected()  
          Tells whether or not this decoder has yet detected a charset  (optional operation).  
 CodingErrorAction malformedInputAction()  
          Returns this decoder's current action for malformed-input errors.  
 float maxCharsPerByte()  
          Returns the maximum number of characters that will be produced for each byte of input.  
 CharsetDecoder onMalformedInput(CodingErrorAction newAction)  
          Changes this decoder's action for malformed-input errors.  
 CharsetDecoder onUnmappableCharacter(CodingErrorAction newAction)  
          Changes this decoder's action for unmappable-character errors.  
 String replacement()  
          Returns this decoder's replacement value.  
 CharsetDecoder replaceWith(String newWordStrment)  
          Changes this decoder's replacement value.  
 CharsetDecoder reset()  
          Resets this decoder, clearing any internal state.  
 CodingErrorAction unmappableCharacterAction()  
          Returns this decoder's current action for unmappable-character errors. 

現實完成解碼的辦法上:

    package java.nio.charset;  
    public abstract class CharsetDecoder  
    {  
            // This is a partial API listing  
            public final CharsetDecoder reset()   
            public final CharBuffer decode (ByteBuffer in)     
                   throws CharacterCodingException  
            public final CoderResult decode (ByteBuffer in, CharBuffer out,     
                   boolean endOfInput)  
            public final CoderResult flush (CharBuffer out)  
    }  

解碼處置和編碼相似,包括雷同的根本步調:

1.   復位解碼器,經由過程挪用 reset() ,把解碼器放在一個已知的狀況預備用來吸收輸出。

2.   把endOfInput 設置成 false 不挪用或屢次挪用 decode(),供應字節到解碼引擎中。跟著解碼的停止,字符將被添加到給定的 CharBuffer 中。

3.   把endOfInput 設置成 true 挪用一次 decode(),告訴解碼器曾經供給了一切的輸出。

4.   挪用flush() ,確保一切的解碼字符都曾經發送給輸入。


示例6-2 解釋了若何對表現字符集編碼的字撙節停止編碼。

示例6 -2.  字符集解碼

    package com.ronsoft.books.nio.charset; 

    import java.nio.*; 
    import java.nio.charset.*; 
    import java.nio.channels.*; 
    import java.io.*; 

    /**
     * Test charset decoding.
     * 
     * @author Ron Hitchens ([email protected])
     */ 
    public class CharsetDecode { 
        /**
         * Test charset decoding in the general case, detecting and handling buffer
         * under/overflow and flushing the decoder state at end of input. This code
         * reads from stdin and decodes the ASCII-encoded byte stream to chars. The
         * decoded chars are written to stdout. This is effectively a 'cat' for
         * input ascii files, but another charset encoding could be used by simply
         * specifying it on the command line.
         */ 
        public static void main(String[] argv) throws IOException { 
            // Default charset is standard ASCII 
            String charsetName = "ISO-8859-1"; 
            // Charset name can be specified on the command line 
            if (argv.length > 0) { 
                charsetName = argv[0]; 
            } 
            // Wrap a Channel around stdin, wrap a channel around stdout, 
            // find the named Charset and pass them to the deco de method. 
            // If the named charset is not valid, an exception of type 
            // UnsupportedCharsetException will be thrown. 
            decodeChannel(Channels.newChannel(System.in), new OutputStreamWriter( 
                    System.out), Charset.forName(charsetName)); 
        } 

        /**
         * General purpose static method which reads bytes from a Channel, decodes
         * them according
         * 
         * @param source
         *            A ReadableByteChannel object which will be read to EOF as a
         *            source of encoded bytes.
         * @param writer
         *            A Writer object to which decoded chars will be written.
         * @param charset
         *            A Charset object, whose CharsetDecoder will be used to do the
         *            character set decoding. Java NIO 206
         */ 
        public static void decodeChannel(ReadableByteChannel source, Writer writer, 
                Charset charset) throws UnsupportedCharsetException, IOException { 
            // Get a decoder instance from the Charset 
            CharsetDecoder decoder = charset.newDecoder(); 
            // Tell decoder to replace bad chars with default mark 
            decoder.onMalformedInput(CodingErrorAction.REPLACE); 
            decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); 
            // Allocate radically different input and output buffer sizes 
            // for testing purposes 
            ByteBuffer bb = ByteBuffer.allocateDirect(16 * 1024); 
            CharBuffer cb = CharBuffer.allocate(57); 
            // Buffer starts empty; indicate input is needed 
            CoderResult result = CoderResult.UNDERFLOW; 
            boolean eof = false; 
            while (!eof) { 
                // Input buffer underflow; decoder wants more input 
                if (result == CoderResult.UNDERFLOW) { 
                    // decoder consumed all input, prepare to refill 
                    bb.clear(); 
                    // Fill the input buffer; watch for EOF 
                    eof = (source.read(bb) == -1); 
                    // Prepare the buffer for reading by decoder 
                    bb.flip(); 
                } 
                // Decode input bytes to output chars; pass EOF flag 
                result = decoder.decode(bb, cb, eof); 
                // If output buffer is full, drain output 
                if (result == CoderResult.OVERFLOW) { 
                    drainCharBuf(cb, writer); 
                } 
            } 
            // Flush any remaining state from the decoder, being careful 
            // to detect output buffer overflow(s) 
            while (decoder.flush(cb) == CoderResult.OVERFLOW) { 
                drainCharBuf(cb, writer); 
            } 
            // Drain any chars remaining in the output buffer 
            drainCharBuf(cb, writer); 
            // Close the channel; push out any buffered data to stdout 
            source.close(); 
            writer.flush(); 
        } 

        /**
         * Helper method to drain the char buffer and write its content to the given
         * Writer object. Upon return, the buffer is empty and ready to be refilled.
         * 
         * @param cb
         *            A CharBuffer containing chars to be written.
         * @param writer
         *            A Writer object to consume the chars in cb.
         */ 
        static void drainCharBuf(CharBuffer cb, Writer writer) throws IOException { 
            cb.flip(); // Prepare buffer for draining 
            // This writes the chars contained in the CharBuffer but 
            // doesn't actually modify the state of the buffer. 
            // If the char buffer was being drained by calls to get( ), 
            // a loop might be needed here. 
            if (cb.hasRemaining()) { 
                writer.write(cb.toString()); 
            } 
            cb.clear(); // Prepare buffer to be filled again 
        } 
    } 

字符集辦事器供給者接口:可插拔的 SPI 構造是在很多分歧的內容中貫串於 Java 情況應用的。在 1.4JDK中有八個包,一個叫spi 而剩下的有其它的稱號。可插拔是一個功效壯大的設計技巧,是在 Java 的可移植性和順應性上樹立的基石之一。

在閱讀 API 之前,須要說明一下 Charset SPI 若何任務。java.nio.charset.spi 包僅包括一個抽取類,CharsetProvider 。這個類的詳細完成供應與它們供給過的 Charset對象相干的信息。為了界說自界說字符集,您起首必需從 java.nio.charset package中創立 Charset, CharsetEncoder,和CharsetDecoder 的詳細完成。然後您創立CharsetProvider 的自界說子類,它將把那些類供給給JVM。

創立自界說字符集:

您至多要做的是創立 java.nio.charset.Charset 的子類、供給三個抽取辦法的詳細完成和一個結構函數。Charset類沒有默許的,無參數的結構函數。這表現您的自界說字符集類必需有一個結構函數,即便它不接收參數。這是由於您必需在實例化時挪用 Charset的結構函數(經由過程在您的結構函數的開始挪用 super() ),從而經由過程您的字符集標准稱號和別號供應它。如許做可讓 Charset類中的辦法幫您處置和稱號相干的工作,所所以件功德。

異樣地,您須要供給 CharsetEncoder和CharsetDecoder 的詳細完成。回憶一下,字符集是編碼的字符和編碼/解碼計劃的聚集。如我們之前所看到的,編碼息爭碼在 API 程度上簡直是對稱的。這裡給出了關於完成編碼器所須要的器械的冗長評論辯論:一樣實用於樹立解碼器。

與Charset相似的, CharsetEncoder 沒有默許的結構函數,所以您須要在詳細類結構函數中挪用super() ,供給須要的參數。

為了供應您本身的 CharsetEncoder 完成,您至多要供給詳細encodeLoop () 辦法。關於簡略的編碼運算軌則,其他辦法的默許完成應當可以正常停止。留意encodeLoop() 采取和 encode() 的參數相似的參數,不包含布爾標記。encode () 辦法代表到encodeLoop() 的現實編碼,它僅須要存眷來自 CharBuffer 參數消費的字符,而且輸入編碼的字節到供給的 ByteBuffer上。


如今,我們曾經看到了若何完成自界說字符集,包含相干的編碼器息爭碼器,讓我們看一下若何把它們銜接到 JVM中,如許可以應用它們運轉代碼。


供應您的自界說字符集:

 為了給 JVM運轉時情況供給您本身的 Charset完成,您必需在 java.nio.charsets. - spi 中創立 CharsetProvider 類的詳細子類,每一個都帶有一個無參數結構函數。無參數結構函數很主要,由於您的 CharsetProvider 類將要經由過程讀取設置裝備擺設文件的全體及格稱號停止定位。以後這個類稱號字符串將被導入到 Class.newInstance() 來實例化您的供給方,它僅經由過程無參數結構函數起感化。

JVM讀取的設置裝備擺設文件定位字符集供給方,被定名為 java.nio.charset.spi.CharsetProvider 。它在JVM類途徑中位於源目次(META-INF/services)中。每個 JavaArchive(Java 檔案文件)(JAR )都有一個 META-INF 目次,它可以包括在誰人 JAR 中的類和資本的信息。一個名為META-INF 的目次也能夠在 JVM類途徑中放置在慣例目次的頂端。

CharsetProvider 的API 簡直是沒有感化的。供給自界說字符集的現實任務是產生在創立自界說 Charset,CharsetEncoder,和 CharsetDecoder 類中。CharsetProvider 僅是銜接您的字符集和運轉時情況的增進者。


示例 6-3 中演示了自界說 Charset和CharsetProvider 的完成,包括解釋字符集應用的取樣代碼,編碼息爭碼,和 Charset SPI。示例 6-3 完成了一個自界說Charset。

 示例6 -3. 自界說Rot13 字符集

    package com.ronsoft.books.nio.charset; 

    import java.nio.CharBuffer; 
    import java.nio.ByteBuffer; 
    import java.nio.charset.Charset; 
    import java.nio.charset.CharsetEncoder; 
    import java.nio.charset.CharsetDecoder; 
    import java.nio.charset.CoderResult; 
    import java.util.Map; 
    import java.util.Iterator; 
    import java.io.Writer; 
    import java.io.PrintStream; 
    import java.io.PrintWriter; 
    import java.io.OutputStreamWriter; 
    import java.io.BufferedReader; 
    import java.io.InputStreamReader; 
    import java.io.FileReader; 

    /**
     * A Charset implementation which performs Rot13 encoding. Rot -13 encoding is a
     * simple text obfuscation algorithm which shifts alphabetical characters by 13
     * so that 'a' becomes 'n', 'o' becomes 'b', etc. This algorithm was popularized
     * by the Usenet discussion forums many years ago to mask naughty words, hide
     * answers to questions, and so on. The Rot13 algorithm is symmetrical, applying
     * it to text that has been scrambled by Rot13 will give you the original
     * unscrambled text.
     * 
     * Applying this Charset encoding to an output stream will cause everything you
     * write to that stream to be Rot13 scrambled as it's written out. And appying
     * it to an input stream causes data read to be Rot13 descrambled as it's read.
     * 
     * @author Ron Hitchens ([email protected])
     */ 
    public class Rot13Charset extends Charset { 
        // the name of the base charset encoding we delegate to 
        private static final String BASE_CHARSET_NAME = "UTF-8"; 
        // Handle to the real charset we'll use for transcoding between 
        // characters and bytes. Doing this allows us to apply the Rot13 
        // algorithm to multibyte charset encodings. But only the 
        // ASCII alpha chars will be rotated, regardless of the base encoding. 
        Charset baseCharset; 

        /**
         * Constructor for the Rot13 charset. Call the superclass constructor to
         * pass along the name(s) we'll be known by. Then save a reference to the
         * delegate Charset.
         */ 
        protected Rot13Charset(String canonical, String[] aliases) { 
            super(canonical, aliases); 
            // Save the base charset we're delegating to 
            baseCharset = Charset.forName(BASE_CHARSET_NAME); 
        } 

        // ---------------------------------------------------------- 
        /**
         * Called by users of this Charset to obtain an encoder. This implementation
         * instantiates an instance of a private class (defined below) and passes it
         * an encoder from the base Charset.
         */ 
        public CharsetEncoder newEncoder() { 
            return new Rot13Encoder(this, baseCharset.newEncoder()); 
        } 

        /**
         * Called by users of this Charset to obtain a decoder. This implementation
         * instantiates an instance of a private class (defined below) and passes it
         * a decoder from the base Charset.
         */ 
        public CharsetDecoder newDecoder() { 
            return new Rot13Decoder(this, baseCharset.newDecoder()); 
        } 

        /**
         * This method must be implemented by concrete Charsets. We always say no,
         * which is safe.
         */ 
        public boolean contains(Charset cs) { 
            return (false); 
        } 

        /**
         * Common routine to rotate all the ASCII alpha chars in the given
         * CharBuffer by 13. Note that this code explicitly compares for upper and
         * lower case ASCII chars rather than using the methods
         * Character.isLowerCase and Character.isUpperCase. This is because the
         * rotate-by-13 scheme only works properly for the alphabetic characters of
         * the ASCII charset and those methods can return true for non-ASCII Unicode
         * chars.
         */ 
        private void rot13(CharBuffer cb) { 
            for (int pos = cb.position(); pos < cb.limit(); pos++) { 
                char c = cb.get(pos); 
                char a = '\u0000'; 
                // Is it lowercase alpha? 
                if ((c >= 'a') && (c <= 'z')) { 
                    a = 'a'; 
                } 
                // Is it uppercase alpha? 
                if ((c >= 'A') && (c <= 'Z')) { 
                    a = 'A'; 
                } 
                // If either, roll it by 13 
                if (a != '\u0000') { 
                    c = (char) ((((c - a) + 13) % 26) + a); 
                    cb.put(pos, c); 
                } 
            } 
        } 

        // -------------------------------------------------------- 
        /**
         * The encoder implementation for the Rot13 Chars et. This class, and the
         * matching decoder class below, should also override the "impl" methods,
         * such as implOnMalformedInput( ) and make passthrough calls to the
         * baseEncoder object. That is left as an exercise for the hacker.
         */ 
        private class Rot13Encoder extends CharsetEncoder { 
            private CharsetEncoder baseEncoder; 

            /**
             * Constructor, call the superclass constructor with the Charset object
             * and the encodings sizes from the delegate encoder.
             */ 
            Rot13Encoder(Charset cs, CharsetEncoder baseEncoder) { 
                super(cs, baseEncoder.averageBytesPerChar(), baseEncoder 
                        .maxBytesPerChar()); 
                this.baseEncoder = baseEncoder; 
            } 

            /**
             * Implementation of the encoding loop. First, we apply the Rot13
             * scrambling algorithm to the CharBuffer, then reset the encoder for
             * the base Charset and call it's encode( ) method to do the actual
             * encoding. This may not work properly for non -Latin charsets. The
             * CharBuffer passed in may be read -only or re-used by the caller for
             * other purposes so we duplicate it and apply the Rot13 encoding to the
             * copy. We DO want to advance the position of the input buffer to
             * reflect the chars consumed.
             */ 
            protected CoderResult encodeLoop(CharBuffer cb, ByteBuffer bb) { 
                CharBuffer tmpcb = CharBuffer.allocate(cb.remaining()); 
                while (cb.hasRemaining()) { 
                    tmpcb.put(cb.get()); 
                } 
                tmpcb.rewind(); 
                rot13(tmpcb); 
                baseEncoder.reset(); 
                CoderResult cr = baseEncoder.encode(tmpcb, bb, true); 
                // If error or output overflow, we need to adjust 
                // the position of the input buffer to match what 
                // was really consumed from the temp buffer. If 
                // underflow (all input consumed), this is a no-op. 
                cb.position(cb.position() - tmpcb.remaining()); 
                return (cr); 
            } 
        } 

        // -------------------------------------------------------- 
        /**
         * The decoder implementation for the Rot13 Charset.
         */ 
        private class Rot13Decoder extends CharsetDecoder { 
            private CharsetDecoder baseDecoder; 

            /**
             * Constructor, call the superclass constructor with the Charset object
             * and pass alon the chars/byte values from the delegate decoder.
             */ 
            Rot13Decoder(Charset cs, CharsetDecoder baseDecoder) { 
                super(cs, baseDecoder.averageCharsPerByte(), baseDecoder 
                        .maxCharsPerByte()); 
                this.baseDecoder = baseDecoder; 
            } 

            /**
             * Implementation of the decoding loop. First, we reset the decoder for
             * the base charset, then call it to decode the bytes into characters,
             * saving the result code. The CharBuffer is then de-scrambled with the
             * Rot13 algorithm and the result code is returned. This may not work
             * properly for non -Latin charsets.
             */ 
            protected CoderResult decodeLoop(ByteBuffer bb, CharBuffer cb) { 
                baseDecoder.reset(); 
                CoderResult result = baseDecoder.decode(bb, cb, true); 
                rot13(cb); 
                return (result); 
            } 
        } 

        // -------------------------------------------------------- 
        /**
         * Unit test for the Rot13 Charset. This main( ) will open and read an input
         * file if named on the command line, or stdin if no args are provided, and
         * write the contents to stdout via the X -ROT13 charset encoding. The
         * "encryption" implemented by the Rot13 algorithm is symmetrical. Feeding
         * in a plain-text file, such as Java source code for example, will output a
         * scrambled version. Feeding the scrambled version back in will yield the
         * original plain-text document.
         */ 
        public static void main(String[] argv) throws Exception { 
            BufferedReader in; 
            if (argv.length > 0) { 
                // Open the named file 
                in = new BufferedReader(new FileReader(argv[0])); 
            } else { 
                // Wrap a BufferedReader around stdin 
                in = new BufferedReader(new InputStreamReader(System.in)); 
            } 
            // Create a PrintStream that uses the Rot13 encoding 
            PrintStream out = new PrintStream(System.out, false, "X -ROT13"); 
            String s = null; 
            // Read all input and write it to the output. 
            // As the data passes through the PrintStream, 
            // it will be Rot13-encoded. 
            while ((s = in.readLine()) != null) { 
                out.println(s); 
            } 
            out.flush(); 
        } 
    } 

為了應用這個 Charset和它的編碼器與解碼器,它必需對 Java 運轉時情況有用。用CharsetProvider 類完成(示例 6-4)。

示例6 -4. 自界說字符集供給方

    package com.ronsoft.books.nio.charset; 

    import java.nio.charset.Charset; 
    import java.nio.charset.spi.CharsetProvider; 
    import java.util.HashSet; 
    import java.util.Iterator; 

    /**
     * A CharsetProvider class which makes available the charsets provided by
     * Ronsoft. Currently there is only one, namely the X -ROT13 charset. This is
     * not a registered IANA charset, so it's name begins with "X-" to avoid name
     * clashes with offical charsets.
     * 
     * To activate this CharsetProvider, it's necessary to add a file to the
     * classpath of the JVM runtime at the following location:
     * META-INF/services/java.nio.charsets.spi.CharsetP rovider
     * 
     * That file must contain a line with the fully qualified name of this class on
     * a line by itself: com.ronsoft.books.nio.charset.RonsoftCharsetProvider Java
     * NIO 216
     * 
     * See the javadoc page for java.nio.charsets.spi.CharsetProvider for full
     * details.
     * 
     * @author Ron Hitchens ([email protected])
     */ 
    public class RonsoftCharsetProvider extends CharsetProvider { 
        // the name of the charset we provide 
        private static final String CHARSET_NAME = "X-ROT13"; 
        // a handle to the Charset object 
        private Charset rot13 = null; 

        /**
         * Constructor, instantiate a Charset object and save the reference.
         */ 
        public RonsoftCharsetProvider() { 
            this.rot13 = new Rot13Charset(CHARSET_NAME, new String[0]); 
        } 

        /**
         * Called by Charset static methods to find a particular named Charset. If
         * it's the name of this charset (we don't have any aliases) then return the
         * Rot13 Charset, else return null.
         */ 
        public Charset charsetForName(String charsetName) { 
            if (charsetName.equalsIgnoreCase(CHARSET_NAME)) { 
                return (rot13); 
            } 
            return (null); 
        } 

        /**
         * Return an Iterator over the set of Charset objects we provide.
         * 
         * @return An Iterator object containing references to all the Charset
         *         objects provided by this class.
         */ 
        public Iterator<Charset> charsets() { 
            HashSet<Charset> set = new HashSet<Charset>(1); 
            set.add(rot13); 
            return (set.iterator()); 
        } 
    } 

關於經由過程 JVM運轉時情況看到的這個字符集供給方,名為META_INF/services/java.nio.charset.spi.CharsetProvider的文件必需存在於 JARs 之一內或類途徑的目次中。誰人文件的內容必需是:
 com.ronsoft.books.nio.charset.RonsoftCharsetProvider

    在示例 6-1 中的字符集清單中添加 X -ROT13,發生這個額定的輸入:  

    Charset: X-ROT13  
      Input:   żMańana?   
    Encoded:     
       0: c2 (Ż)  
       1: bf (ż)  
       2: 5a (Z)  
       3: 6e (n)  
       4: c3 (Ă)  
       5: b1 (±)  
       6: 6e (n)  
       7: 61 (a)  
       8: 6e (n)  
       9: 3f (?)  

總結:很多Java 編程人員永久不會須要處置字符集編碼轉換成績,而年夜多半永久不會創立自界說字符集。然則關於那些須要的人,在 java.nio.charset 和java.charset.spi 中的一系列類為字符處置供給了壯大的和彈性的機制。

Charset(字符集類)
         封裝編碼的字符集編碼計劃,用來表現與作為字節序列的字符集分歧的字符序列。

CharsetEncoder(字符集編碼類)
         編碼引擎,把字符序列轉化成字節序列。以後字節序列可以被解碼從而從新結構源字符序列。

CharsetDecoder(字符集解碼器類)
解碼引擎,把編碼的字節序列轉化為字符序列。

CharsetProvider  SPI(字符集供給商 SPI)
         經由過程辦事器供給商機制訂位並使 Charset完成可用,從而在運轉時情況中應用。

  1. 上一頁:
  2. 下一頁:
Copyright © 程式師世界 All Rights Reserved