External Sort  External Sort

go to home page Student Projects full screen, hide local find menu Google search web for more information on this topic jump to foot of page translate this page with Babelfish by Roedy Green ©1996-2009 Canadian Mind Products
This essay does not describe an existing computer program, just one that should exist. This essay is about a suggested student project in Java programming. This essay gives a rough overview of how it might work. I have no source, object, specifications, file layouts or anything else useful to implementing this project. Everything I have to say to help you with this project is written below. I am not prepared to help you implement it; I have too many other projects of my own.

I do contract work for a living, which could include writing a program such as this. However, I don’t do people’s homework for them. That just robs them of an education.

You have my full permission to implement this project in any way you please and to keep all the profits from your endeavor.

What do you do when a sort is too big to fit in RAM? When Sun’s Collections. sort fails? You need some sort of external sort that uses intermediate disk space. I can think of four ways to implement this:

  1. Create an easy-to-use interface to an existing external sort such as the Unix sort or Opttech sort.
  2. Create a traditional external sort that sorts records made of an array of bytes, the way mainframes have done it since the 1960s. The problem is the user must convert his data to byte arrays with DataOutputStream, and specify sort keys in terms of inconvenient byte offsets, then reconstitute Objects with a DataInputStream from the bytes when the sort is done.
  3. Implement a tag sort. You create small records containing only the key and relative record number, and sort them with an internal sort. They might just fit, when the entire dataset would not.
  4. Create a sort that behaves just like Sun’s that transparently uses extra disk space when the data set is large. The advantage is the user can specify the sort with a Comparator, using Java field names and operators. The disadvantage is the intermediate files would have to be stored as serialised records. There is considerable serialization overhead, both in terms of speed and size. Further the records to be sorted have to be serializable, something not required for an internal sort. Unfortunately, to compare records, the intermediate records must be reconstituted at every stage of the sort. Even though this would be quite slow, it might be the most popular version since it could be slid in without changing the application. To convert to a byte-array sort would be a lot of work, to a primitive addressing scheme. That effort might better be spend converting to an SQL database that has external sorting built-in.
An external sort typically sorts a chunk of N records in RAM then writes it out to a file. It repeats, each time writing it out to a different file. When it has written K files (where optimum N and K are to be determined by experiment. K will likely turn out be in the range 4…30), it reuses the files, appending to the end.

Then it does a K-way merge, reading from K files and writing the merged results of the first chunk in each file to another file. You read the files with a buffered reader with buffer size B, (where again the optimum size of B is determined by experiment.) Then it merges the second chunks etc. Each merge pass reduces the number of chunks to 1/K, and makes each chunk K times as large. You repeat until all the chunk are merged into one sorted one.

How cleverly you estimate values for N, K and B will largely determine the performance of your sort.

external sorting
HTML table sorter
JASM
NIST on external sorts
on the fly compose, compile and run
sort comparator
sort visualiser

CMP homejump to top
CMP logo
feedback Please email your feedback for publication, errors, omissions, broken/redirected link reports
and suggestions to improve this page to Roedy Green : feedback email
made with CSS
HTML Checked!
ICRA ratings logo
mindprod.com IP:[65.110.21.43]
Your face IP:[38.103.63.58]
You are visitor number 11.
You can get a fresh copy of this page from: or possibly from your local J: drive (Java virtual drive/mindprod.com website mirror)
http://mindprod.com/project/externalsort.html J:\mindprod\project\externalsort.html