程式師世界 >> 編程語言 >> JAVA編程 >> 關於JAVA >> 調整 Java I/O 性能

調整 Java I/O 性能

編輯：關於JAVA

這篇文章討論和闡明了提供 JavaTM I/O 性能的多種技術。大多技術圍繞調整磁盤文件 I/O,但是有些內容也同樣適合網絡 I/O 和窗口輸出。第一部分技術討論底層的I/O問題，然後討論諸如壓縮，格式化和串行化等高級I/O問題。然而這個討論沒有包含應用設計問題，例如搜索算法和數據結構，也沒有討論系統級的問題，例如文件高速緩沖。當我們討論Java I/O時,值得注意的是Java語言采取兩種截然不同的磁盤文件結構。一個是基於字節流，另一個是字符序列。在Java語言中一個字符有兩個字節表示，而不是像通常的語言如c語言那樣是一個字節。因此，從一個文件讀取字符時需要進行轉換。這個不同在某些情況下是很重要的，就像下面的幾個例子將要展示的那樣。低級 I/O I問題加速I/O的基本規則緩沖讀寫文本文件格式化的代價隨機訪問高級I/O問題壓縮高速緩沖分解串行化獲取文件信息更多信息加速I/O的基本規則作為這個討論的開始，這裡有幾個如何加速I/O的基本規則: 避免訪問磁盤避免訪問底層的操作系統避免方法調用避免個別的處理字節和字符很明顯這些規則不能在所有的問題上避免，因為如果能夠的話就沒有實際的I/O被執行。考慮下面的計算文件中的新行符('\n')的三部分范例。方法1: 讀方法第一個方法簡單的使用FileInputStream的read方法: import java.io.*; public class intro1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); int cnt = 0; int b; while ((b = fis.read()) != -1) { if (b == '\n') cnt++; } fis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } }然而這個方法觸發了大量的底層運行時系統調用－－FileInputStream.read－－返回文件的下一個字節的本機方法。方法 2: 使用大緩沖區第二種方法使用大緩沖區避免了上面的問題: import java.io.*; public class intro2 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); BufferedInputStream bis = new BufferedInputStream(fis); int cnt = 0; int b; while ((b = bis.read()) != -1) { if (b == '\n') cnt++; } bis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } }BufferedInputStream.read 從輸入緩沖區獲取下一個字節，僅僅只訪問了一次底層系統。方法 3: 直接緩沖第三種方法避免使用 BufferedInputStream 而直接緩沖，因此排除了 read 方法的調用: import java.io.*; public class intro3 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); byte buf[] = new byte[2048]; int cnt = 0; int n; while ((n = fis.read(buf)) != -1) { for (int i = 0; i < n; i++) { if (buf[i] == '\n') cnt++; } } fis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } }對於一個1 MB 的輸入文件，以秒為單位的執行時間是: intro1 6.9 intro2 0.9 intro3 0.4或者說在最慢的方法和最快的方法間是17比1的不同。這個巨大的加速並不能證明你應該總是使用第三種方法，即自己做緩沖。這可能是一個錯誤的傾向特別是在處理文件結束事件時沒有仔細的實現。在可讀性上它也沒有其它方法好。但是記住時間花費在哪兒了以及在必要的時候如何矯正是很有用。方法2 或許是對於大多應用的 "正確" 方法. 緩沖方法 2 和 3 使用了緩沖技術, 大塊文件被從磁盤讀取，然後每次訪問一個字節或字符。緩沖是一個基本而重要的加速I/O 的技術,而且有幾個類支持緩沖(BufferedInputStream 用於字節, BufferedReader 用於字符)。一個明顯得問題是: 緩沖區越大I/O越快嗎？典型的Java緩沖區長1024 或者 2048 字節，一個更大的緩沖區有可能加速 I/O但是只能占很小的比重，大約5 到10%。方法4: 整個文件緩沖的極端情況是事先決定整個文件的長度，然後讀取整個文件: import java.io.*; public class readfile { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { int len = (int)(new File(args[0]).length()); FileInputStream fis = new FileInputStream(args[0]); byte buf[] = new byte[len]; fis.read(buf); fis.close(); int cnt = 0; for (int i = 0; i < len; i++) { if (buf[i] == '\n') cnt++; } System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } }這個方法很方便，在這裡文件被當作一個字節數組。但是有一個明顯得問題是有可能沒有讀取一個巨大的文件的足夠的內存。緩沖的另一個方面是向窗口終端的文本輸出。缺省情況下， System.out ( 一個PrintStream) 是行緩沖的，這意味著在遇到一個新行符後輸出緩沖區被提交。對於交互來說這是很重要的，在那種情況下你可能喜歡在實際的輸出前顯示一個輸入提示。方法 5: 關閉行緩沖行緩沖可以被禁止，像下面的例子那樣: import java.io.*; public class bufout { public static void main(String args[]) { FileOutputStream fdout = new FileOutputStream(FileDescriptor.out); BufferedOutputStream bos = new BufferedOutputStream(fdout, 1024); PrintStream ps = new PrintStream(bos, false); System.setOut(ps); final int N = 100000; for (int i = 1; i <= N; i++) System.out.println(i); ps.close(); } }這個程序輸出整數1到100000缺省輸出，比在缺省的行緩沖情況下快了三倍。緩沖也是下面將要展示的例子的重要部分，在那裡，緩沖區被用於加速文件隨機訪問。讀寫文本文件早些時候曾提到從文件裡面讀取字符的方法調用的消耗可能是重大的。這個問題在計算文本文件的行數的另一個例子中也可以找到。: import java.io.*; public class line1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileInputStream fis = new FileInputStream(args[0]); BufferedInputStream bis = new BufferedInputStream(fis); DataInputStream dis = new DataInputStream(bis); int cnt = 0; while (dis.readLine() != null) cnt++; dis.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } }這個程序使用老的DataInputStream.readLine 方法，該方法是使用用讀取每個字符的 read 方法實現的。一個新方法是: import java.io.*; public class line2 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileReader fr = new FileReader(args[0]); BufferedReader br = new BufferedReader(fr); int cnt = 0; while (br.readLine() != null) cnt++; br.close(); System.out.println(cnt); } catch (IOException e) { System.err.println(e); } } }這個方法更快。例如在一個有200,000行的 6 MB文本文件上，第二個程序比第一個快大約20%。但是即使第二個程序不是更快的，第一個程序依然有一個重要的問題要注意。第一個程序在JavaTM 2編譯器下引起了不贊成警告，因為DataInputStream.readLine太陳舊了。它不能恰當的將字節轉換為字符，因此在操作包含非ASCII字符的文本文件時可能是不合適的選擇。(Java語言使用Unicode字符集而不是ASCII) 這就是早些時候提到的字節流和字符流之間的區別。像這樣的一個程序: import java.io.*; public class conv1 { public static void main(String args[]) { try { FileOutputStream fos = new FileOutputStream("out1"); PrintStream ps = new PrintStream(fos); ps.println("\uffff\u4321\u1234"); ps.close(); } catch (IOException e) { System.err.println(e); } } }向一個文件裡面寫，但是沒有保存實際的Unicode字符輸出。Reader/Writer I/O 類是基於字符的，被設計用來解決這個問題。OutputStreamWriter 應用於字節編碼的字符。一個使用PrintWriter寫入Unicode字符的程序是這樣的: import java.io.*; public class conv2 { public static void main(String args[]) { try { FileOutputStream fos = new FileOutputStream("out2"); OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF8"); PrintWriter pw = new PrintWriter(osw); pw.println("\uffff\u4321\u1234"); pw.close(); } catch (IOException e) { System.err.println(e); } } }這個程序使用UTF8編碼，具有ASCII文本是本身而其他字符是兩個或三個字節的特性。格式化的代價實際上向文件寫數據只是輸出代價的一部分。另一個可觀的代價是數據格式化。考慮一個三部分程序，它像下面這樣輸出一行: The square of 5 is 25方法 1第一種方法簡單的輸出一個固定的字符串，了解固有的I/O開銷: public class format1 { public static void main(String args[]) { final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = "The square of 5 is 25\n"; System.out.print(s); } } }方法2第二種方法使用簡單格式"+": public class format2 { public static void main(String args[]) { int n = 5; final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = "The square of " + n + " is " + n * n + "\n"; System.out.print(s); } } }方法 3第三種方法使用java.text包中的 MessageFormat 類: import java.text.*; public class format3 { public static void main(String args[]) { MessageFormat fmt = new MessageFormat("The square of {0} is {1}\n"); Object values[] = new Object[2]; int n = 5; values[0] = new Integer(n); values[1] = new Integer(n * n); final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = fmt.format(values); System.out.print(s); } } }這些程序產生同樣的輸出。運行時間是: format1 1.3 format2 1.8 format3 7.8或者說最慢的和最快的大約是6比1。如果格式沒有預編譯第三種方法將更慢，使用靜態的方法代替: 方法 4MessageFormat.format(String, Object[]) import java.text.*; public class format4 { public static void main(String args[]) { String fmt = "The square of {0} is {1}\n"; Object values[] = new Object[2]; int n = 5; values[0] = new Integer(n); values[1] = new Integer(n * n); final int COUNT = 25000; for (int i = 1; i <= COUNT; i++) { String s = MessageFormat.format(fmt, values); System.out.print(s); } } }這比前一個例子多花費1/3的時間。第三個方法比前兩種方法慢很多的事實並不意味著你不應該使用它，而是你要意識到時間上的開銷。在國際化的情況下信息格式化是很重要的，關心這個問題的應用程序通常從一個綁定的資源中讀取格式然後使用它。隨機訪問RandomAccessFile 是一個進行隨機文件I/O(在字節層次上)的類。這個類提供一個seek方法，和 C/C++中的相似,移動文件指針到任意的位置，然後從那個位置字節可以被讀取或寫入。 seek方法訪問底層的運行時系統因此往往是消耗巨大的。一個更好的代替是在RandomAccessFile上建立你自己的緩沖，並實現一個直接的字節read方法。read方法的參數是字節偏移量（>= 0）。這樣的一個例子是: import java.io.*; public class ReadRandom { private static final int DEFAULT_BUFSIZE = 4096; private RandomAccessFile raf; private byte inbuf[]; private long startpos = -1; private long endpos = -1; private int bufsize; public ReadRandom(String name) throws FileNotFoundException { this(name, DEFAULT_BUFSIZE); } public ReadRandom(String name, int b) throws FileNotFoundException { raf = new RandoMaccessFile(name, "r"); bufsize = b; inbuf = new byte[bufsize]; } public int read(long pos) { if (pos < startpos || pos > endpos) { long blockstart = (pos / bufsize) * bufsize; int n; try { raf.seek(blockstart); n = raf.read(inbuf); } catch (IOException e) { return -1; } startpos = blockstart; endpos = blockstart + n - 1; if (pos < startpos || pos > endpos) return -1; } return inbuf[(int)(pos - startpos)] & 0xffff; } public void close() throws IOException { raf.close(); } public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { ReadRandom rr = new ReadRandom(args[0]); long pos = 0; int c; byte buf[] = new byte[1]; while ((c = rr.read(pos)) != -1) { pos++; buf[0] = (byte)c; System.out.write(buf, 0, 1); } rr.close(); } catch (IOException e) { System.err.println(e); } } }這個程序簡單的讀取字節序列然後輸出它們。如果有訪問位置，這個技術是很有用的，文件中的附近字節幾乎在同時被讀取。例如，如果你在一個排序的文件上實現二分法查找，這個方法可能很有用。如果你在一個巨大的文件上的任意點做隨機訪問的話就沒有太大價值。壓縮Java提供用於壓縮和解壓字節流的類，這些類包含在java.util.zip 包裡面，這些類也作為 Jar 文件的服務基礎 ( Jar 文件是帶有附加文件列表的 Zip 文件)。下面的程序接收一個輸入文件並將之寫入一個只有一項的壓縮的 Zip 文件: import java.io.*; import java.util.zip.*; public class compress { public static void doit( String filein, String fileout ) { FileInputStream fis = null; FileOutputStream fos = null; try { fis = new FileInputStream(filein); fos = new FileOutputStream(fileout); ZipOutputStream zos = new ZipOutputStream(fos); ZipEntry ze = new ZipEntry(filein); zos.putNextEntry(ze); final int BUFSIZ = 4096; byte inbuf[] = new byte[BUFSIZ]; int n; while ((n = fis.read(inbuf)) != -1) zos.write(inbuf, 0, n); fis.close(); fis = null; zos.close(); fos = null; } catch (IOException e) { System.err.println(e); } finally { try { if (fis != null) fis.close(); if (fos != null) fos.close(); } catch (IOException e) { } } } public static void main(String args[]) { if (args.length != 2) { System.err.println("missing filenames"); System.exit(1); } if (args[0].equals(args[1])) { System.err.println("filenames are identical"); System.exit(1); } doit(args[0], args[1]); } }下一個程序執行相反的過程，將一個假設只有一項的Zip文件作為輸入然後將之解壓到輸出文件: import java.io.*; import java.util.zip.*; public class uncompress { public static void doit( String filein, String fileout ) { FileInputStream fis = null; FileOutputStream fos = null; try { fis = new FileInputStream(filein); fos = new FileOutputStream(fileout); ZipInputStream zis = new ZipInputStream(fis); ZipEntry ze = zis.getNextEntry(); final int BUFSIZ = 4096; byte inbuf[] = new byte[BUFSIZ]; int n; while ((n = zis.read(inbuf, 0, BUFSIZ)) != -1) fos.write(inbuf, 0, n); zis.close(); fis = null; fos.close(); fos = null; } catch (IOException e) { System.err.println(e); } finally { try { if (fis != null) fis.close(); if (fos != null) fos.close(); } catch (IOException e) { } } } public static void main(String args[]) { if (args.length != 2) { System.err.println("missing filenames"); System.exit(1); } if (args[0].equals(args[1])) { System.err.println("filenames are identical"); System.exit(1); } doit(args[0], args[1]); } }壓縮是提高還是損害I/O性能很大程度依賴你的硬件配置，特別是和處理器和磁盤驅動器的速度相關。使用Zip技術的壓縮通常意味著在數據大小上減少50%，但是代價是壓縮和解壓的時間。一個巨大(5到10 MB)的壓縮文本文件，使用帶有IDE硬盤驅動器的300-MHz Pentium PC從硬盤上讀取可以比不壓縮少用大約1/3的時間。壓縮的一個有用的范例是向非常慢的媒介例如軟盤寫數據。使用高速處理器(300 MHz Pentium)和低速軟驅(PC上的普通軟驅)的一個測試顯示壓縮一個巨大的文本文件然後在寫入軟盤比直接寫入軟盤快大約50% 。高速緩存關於硬件的高速緩存的詳細討論超出了本文的討論范圍。但是在有些情況下軟件高速緩存能被用於加速I/O。考慮從一個文本文件裡面以隨機順序讀取一行的情況，這樣做的一個方法是讀取所有的行，然後把它們存入一個ArrayList (一個類似Vector的集合類): import java.io.*; import java.util.ArrayList; public class LineCache { private ArrayList list = new ArrayList(); public LineCache(String fn) throws IOException { FileReader fr = new FileReader(fn); BufferedReader br = new BufferedReader(fr); String ln; while ((ln = br.readLine()) != null) list.add(ln); br.close(); } public String getLine(int n) { if (n < 0) throw new IllegalArgumentException(); return (n < list.size() ? (String)list.get(n) : null); } public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { LineCache lc = new LineCache(args[0]); int i = 0; String ln; while ((ln = lc.getLine(i++)) != null) System.out.println(ln); } catch (IOException e) { System.err.println(e); } } } getLine 方法被用來獲取任意行。這個技術是很有用的，但是很明顯對一個大文件使用了太多的內存，因此有局限性。一個代替的方法是簡單的記住被請求的行最近的100行，其它的請求直接從磁盤讀取。這個安排在局域性的訪問時很有用，但是在真正的隨機訪問時沒有太大作用。分解分解是指將字節或字符序列分割為像單詞這樣的邏輯塊的過程。Java 提供StreamTokenizer 類, 像下面這樣操作: import java.io.*; public class token1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileReader fr = new FileReader(args[0]); BufferedReader br = new BufferedReader(fr); StreamTokenizer st = new StreamTokenizer(br); st.resetSyntax(); st.wordChars('a', 'z'); int tok; while ((tok = st.nextToken()) != StreamTokenizer.TT_EOF) { if (tok == StreamTokenizer.TT_WORD) ;// st.sval has token } br.close(); } catch (IOException e) { System.err.println(e); } } }這個例子分解小寫單詞 (字母a-z)。如果你自己實現同等地功能，它可能像這樣： import java.io.*; public class token2 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } try { FileReader fr = new FileReader(args[0]); BufferedReader br = new BufferedReader(fr); int maxlen = 256; int currlen = 0; char wordbuf[] = new char[maxlen]; int c; do { c = br.read(); if (c >= 'a' && c <= 'z') { if (currlen == maxlen) { maxlen *= 1.5; char xbuf[] = new char[maxlen]; System.arraycopy( wordbuf, 0, xbuf, 0, currlen); wordbuf = xbuf; } wordbuf[currlen++] = (char)c; } else if (currlen > 0) { String s = new String(Wordbuf, 0, currlen); // do something with s currlen = 0; } } while (c != -1); br.close(); } catch (IOException e) { System.err.println(e); } } }第二個程序比前一個運行快大約 20%，代價是寫一些微妙的底層代碼。 StreamTokenizer 是一種混合類，它從字符流(例如 BufferedReader)讀取, 但是同時以字節的形式操作，將所有的字符當作雙字節(大於 0xff) ，即使它們是字母字符。串行化串行化以標准格式將任意的Java數據結構轉換為字節流。例如，下面的程序輸出隨機整數數組: import java.io.*; import java.util.*; public class serial1 { public static void main(String args[]) { ArrayList al = new ArrayList(); Random rn = new Random(); final int N = 100000; for (int i = 1; i <= N; i++) al.add(new Integer(rn.nextInt())); try { FileOutputStream fos = new FileOutputStream("test.ser"); BufferedOutputStream bos = new BufferedOutputStream(fos); ObjectOutputStream oos = new ObjectOutputStream(bos); oos.writeObject(al); oos.close(); } catch (Throwable e) { System.err.println(e); } } }而下面的程序讀回數組: import java.io.*; import java.util.*; public class serial2 { public static void main(String args[]) { ArrayList al = null; try { FileInputStream fis = new FileInputStream("test.ser"); BufferedInputStream bis = new BufferedInputStream(fis); ObjectInputStream ois = new ObjectInputStream(bis); al = (ArrayList)ois.readObject(); ois.close(); } catch (Throwable e) { System.err.println(e); } } }注意我們使用緩沖提高I/O操作的速度。有比串行化更快的輸出大量數據然後讀回的方法嗎？可能沒有，除非在特殊的情況下。例如，假設你決定將文本輸出為64位的整數而不是一組8字節。作為文本的長整數的最大長度是大約20個字符，或者說二進制表示的2.5倍長。這種格式看起來不會快。然而，在某些情況下，例如位圖，一個特殊的格式可能是一個改進。然而使用你自己的方案而不是串行化的標准方案將使你卷入一些權衡。除了串行化實際的I/O和格式化開銷外(使用DataInputStream和 DataOutputStream), 還有其他的開銷，例如在串行化恢復時的創建新對象的需要。注意DataOutputStream 方法也可以用於開發半自定義數據格式，例如: import java.io.*; import java.util.*; public class binary1 { public static void main(String args[]) { try { FileOutputStream fos = new FileOutputStream("outdata"); BufferedOutputStream bos = new BufferedOutputStream(fos); DataOutputStream dos = new DataOutputStream(bos); Random rn = new Random(); final int N = 10; dos.writeInt(N); for (int i = 1; i <= N; i++) { int r = rn.nextInt(); System.out.println(r); dos.writeInt(r); } DOS.close(); } catch (IOException e) { System.err.println(e); } } }和: import java.io.*; public class binary2 { public static void main(String args[]) { try { FileInputStream fis = new FileInputStream("outdata"); BufferedInputStream bis = new BufferedInputStream(fis); DataInputStream dis = new DataInputStream(bis); int N = dis.readInt(); for (int i = 1; i <= N; i++) { int r = dis.readInt(); System.out.println(r); } dis.close(); } catch (IOException e) { System.err.println(e); } } }這些程序將10個整數寫入文件然後讀回它們。獲取文件信息迄今為止我們的討論圍繞單一的文件輸入輸出。但是加速I/O性能還有另一方面－－和得到文件特性有關。例如，考慮一個打印文件長度的小程序: import java.io.*; public class length1 { public static void main(String args[]) { if (args.length != 1) { System.err.println("missing filename"); System.exit(1); } File f = new File(args[0]); long len = f.length(); System.out.println(len); } }Java運行時系統自身並不知道文件的長度，因此必須向底層的操作系統查詢以獲得這個信息，對於文件的其他信息這也成立，例如文件是否是一個目錄，文件上次修改時間等等。 java.io包中的File 類提供一套查詢這些信息的方法。這些方法總體來說在時間上開銷很大因此應該盡可能少用。下面是一個查詢文件信息的更長的范例，它遞歸整個文件系統寫出所有的文件路徑: import java.io.*; public class roots { public static void visit(File f) { System.out.println(f); } public static void walk(File f) { visit(f); if (f.isDirectory()) { String list[] = f.list(); for (int i = 0; i < list.length; i++) walk(new File(f, list[i])); } } public static void main(String args[]) { File list[] = File.listRoots(); for (int i = 0; i < list.length; i++) { if (list[i].exists()) walk(list[i]); else System.err.println("not Accessible: " + list[i]); } } }這個范例使用 File 方法，例如 isDirectory 和 exists,穿越目錄結構。每個文件都被查詢一次它的類型 (普通文件或者目錄)。更多信息論文: JDC性能技巧和防火牆隧道技術討論了一些提高 Java 應用程序的通用方法。有些對於上面發現的問題，其他一些處理底層的問題。 Don Knuth 的書, 計算機編程藝術, 第三卷，討論了排序和搜索算法，例如使用B樹。