CObArray sorting
jeffg@iqpuucp.aa.iqsc.com Monday, February 12, 1996 I am adding about 14,000 items to a CObArray. It has to be sorted, so I am using the InsertAt() function to insert it correctly. When the function starts, it adds about 100 items/second, but as it draws near the end, it does 1 item/second. Potentially, if a client had over 100,000 items, this would slow down so much it could take days to complete. What I am looking for is a way to sort the CObArray in a faster manner, or alternatively, use something else that allows sorting, yet is able to maintain its input speed.
David W. Gillett -- DGILLETT@expertedge.com Tuesday, February 13, 1996 [Mini-digest: 13 responses - some sort of record, I bet ;-] Insertion sort is an O(n^2) algorithm, a poor choice for large data volumes. It works well when data shows up in ones and twos to be added to an already-sorted list, but it works poorly for sorting an entire collection and isn't really very suitable for arrays. A better choice is to use something like QuickSort; the qsort() library routine should be able to work. [The Help I have handy is for MFC 2.5, and indicates that the array is stored in a way which would permit this. Presumably this constitutes a documented commitment for subsequent versions.] You'll need to write a comparison routine for qsort(), and note that the array entries are actually pointers. Any decent C programming FAQ will include extensive coverage of qsort(). Off the top of my head, I believe QuickSort is probably no worse than O(n log n). For small data sets, it may be slower than insertion sort, but as the dataset size grows the time for QuickSort doesn't grow as quickly as it does for insertion sort -- and that's what you want. Dave -----From: Tim Philip>From your description I am going to conclude that InsertAt() is probably an O(n) function. Although I'm not certain about that. I think that is worst case. By sorting every time you are processing 14,000 x 0(n). If you add all the members to the array and then do a simple BubbleSort (it is only O(n^2) and I have a function that does it for you) you will only be doing 14,000^2. Wait a second, these are both the same. Damn. Well then, try a MergeSort ( O(n*log n) ) or a QuickSort ( O(n*log n) ). QuickSort is considered faster. Damn, really thought I had something there. Ok, here is another idea. Declare an array of space big enough to hold your data using malloc (use realloc if you need to grow, etc) and use qsort() and bsearch() from the stdlib.h library. Or (I love this stuff), use a binary tree instead of an array. This will massively improve your performance. You could add your 14,000 array elements into a binary tree and then do an inorder traversal of the tree and add the elements into a list. When you start talking about 100,000 items I think you have a problem no matter what though. I think you are far better to go with my first idea and use a QuickSort to sort the entire list in one go. That way you can warn the user ( "this may take a while ..." ) and do it all in one fell swoop. Sorry, I had originally thought I had a simple solution. ------- Tim Philip, B.Sc. Randco Software Corporation Consultant 2530 Hanover Avenue Saskatoon, SK Phone: +1-306-343-3380 philip@cs.usask.ca Canada S7J 1G1 Fax: +1-306-343-3341 -----From: "David Ohlssen" You don't say win16 or win32 or NT. Listbox (32) does something rather neat with this - trace the code! I did a test on NT3.5 some time ago, running on an 8mb (believe me) 486SX33 with the first 32bit VC version (yes it compiled on the 8mb machine). I pumped 250 000 random strings of 100 bytes each into a list box control with it's sort flag set and it was going faster than 1 per sec at the end. Furthermore, when I dragged the thumb fast from end to end, the list box body (20 lines) stayed in sync, no matter how fast I moved the mouse. That sold me on win32 (and MFC)! MS didn't pay me to say this, but are welcome :). David O. Ohlssen I _____________________________________ I I | Davido@Commerce.CTech.ac.ZA | I I | __ ____________ | I I | / \/ Cape Town \ _ | I I |__/ South Africa \/ \__________| I I_________________________________________I -----From: jerasmussen@falconholm.dk How are you doing the search for the correct position to insert the new items?? By linear search?? If that's the case, I *strongly* recommend that you use binary search which search time will only grow *very slowly* compared to that of linear search. This is possible since the items already in the array are sorted. If you don't know how to do a binary search, please let me know :-) Jens-Erik Rasmussen, Falconholm International A/S, Denmark -----From: Steini CObArray is not the right tool to use because the only way to search is sequentially. You should use binary tree and in-order algorithm ( which sorts the data automaticly ). However, if I had to sort so many items I would use a fast database like Btrieve to avoid blowing the heap. Steini stein@ismennt.is steini@sjal.is Steingrimur Jonsson Computer Scientist http://rvik.ismennt.is/~stein -----From: mikeblas@interserv.com You don't mention what version of VC++ you're using. That makes a big difference because the memory manager changed between VC++ 2.x and VC++ 4.0 pretty substantially. Oh, well--if you don't care, then I don't care. Where is all time you're spending? Is it actually in CObArray::InsertAt(), or is it in your own code actually trying to find where you should do your next CObArray::InsertAt()? Did you call CObArray::SetSize()? When? To what size? What do you pass for the nGrowBy parameter? > What I am looking for is a way to sort the CObArray in a faster > manner, or alternatively, use something else that allows sorting, yet > is able to maintain its input speed. It really depends on how you're using the array. Are you keeping it sorted to avoid duplicates? Maybe you should just add the elements as you get them and sort the array all at once--there are dozens of algorithm books which explain the pros and cons of different sorting techniques. .B ekiM -- TCHAR szDisc[] = _T("These words are my own; I do not speak for Microsoft."); -----From: Niels Ull Jacobsen You really need to read up on data structures. And you have to decide what you need: Are all the items inserted at the same time, or are they inserted a little by little. Are items ever deleted? Can you tell (before creating the data structure) how much data you'll need to fit in? What are the acceptable memory overhead? Do you just need to access the items sequentially? Or do you have to find item N? How fast? (constant time, logarithmic time, linear time)? Or is it enough to be able to find and remove the first element? You have several options: If you know all the data "beforehand", just insert them into the array unsorted and sort them using an efficient in-place sorting algorithm like quicksort. BTW, inserting them will be *much* quicker if you first set the size of the array. Or at least set GrowBy to a large number. If you need to maintain a sorted structure, inserting and deleting items at various times, *and* you need to find items relatively fast, you'll have to use a balanced tree of some sort (red-black tree, 2-3 tree or something like that). If you just need to access item N fast, you should probably use a hash table (a CMap). If you only need to access the first item and you must be able to add items relatively fast, use a heap. If you know most of the items beforehand, and you just need to traverse the data sequentially, use a plain list. Sort it with mergesort when you've added most of the items. Updating the list won't be as fast, though. But go to the library and get a book on data structures. -- Niels Ull Jacobsen, Kruger A/S Everything stated herein is THE OFFICIAL POLICY of the entire Kruger group and should be taken as legally binding in every respect. Pigs will grow wings and fly. -----From: Ken Freeman CObArray is probably the least effecient class you could use. When you insert an item, it has to move all the later items in memory. CObList would be better, but with 14,000 items even that would be slow. I would probably look for a b-tree class; I think Rogue Wave sells one. Alternatively, you could keep using CObArray, *but* 1) always add objects to the end; and 2) when they are all added use qsort to sort the array. I'm just guessing, but I think a b-tree would be faster. Ken -----From: jerry_albro@tcpgate.ibi.com Have you tried using SetSize with a large initial value and a large growby factor? This would reduce the number of allocations needed, which are very expensive. If you have tried this and still no joy, then I would try using a simple array (not Cobarray) of pointers, and qsort(). If needed, the array could transferred to a CObArray and the original deleted. Jerry Albro -----From: Brad Wilson An array is a horrible thing to sort. Every time you insert something in the array, all the items after that have to be moved. You should use CObList instead. The problem there, of course, if that you need the random access usefulness of an array, you're still kinda out of luck. In this case, find some way to sort the objects _before_ you add them to the array, then add them in the correct order. This should speed things up quite a bit. Good luck! Brad -- class CBradWilson : public CWorldWatchProgrammingTeam { public: void GetInetAddr ( CString& s ) { s = "bradw@exptech.com"; } void GetE164Addr ( CString& s ) { s = "+1 (810) 620-9803"; } void GetURL ( CString& s ) { s = "http://www.exptech.com"; } void GetDisclaimer( CString& s ) { s = "All I say is fact :-p"; } }; // QOTW: "Music nowadays is merely the art of executing difficulties and in // the end that which is only difficult ceases to please." -----From: "Graham Stalker-Wilde" why are you using an array? Inserting into an array is never going to be a good idea (it's an O(N) operation since every insert has to shift up all the elements above it) Use CObList, (and then copy the list into an array when you're through if you really need an array) The STL's set<> collection, is probably even more efficient, but your best bet, if this is a one-off operation is to add them in any old order then sort them once and for all. - Graham -----From: David Stidolph If you can, set the size of the array up front. What is happening is = that as you add items in the middle a new chunk of memory is allocated = for the new array and all the old contents are copied to the new memory = (saving space for the new item) and the old memory is freed. This also = has the effect of requiring enough memory for both arrays at the same = time. If you cannot set the size of the array you might want to switch to a = linked list (CObList). Lastly, I believe there is a limit on the number of CObArray items (at = least there was in 2.x) - 32,767 items. Does anyone know if this limit = has been lifted? ------------------------------------------------------------------------ David Stidolph Gee that bastard smells. No wonder they call him Pooh. --Chris Robin. -----From: Mats Manhav Hi jeffg, Take a look at the documentation for CObArray::SetSize() > ... > For efficiency, call SetSize to set the size of the array before using it. > This prevents the need to reallocate and copy the array each time an item is added. > ... It seems like you should not use the InsertAt for making sorting since it will force reallocation each time, and along with that a copy of the complete array. Personally I think I would make this in some kind of binary tree instead. Maybe there are some MFC code somewhere which handles binary trees. Mats -- ========================================================================== Mats Mеnhav (Mats Manhav for 7-bit people) email:manhav@connectum.skurup.se WWW: http://connectum.skurup.se/~manhav FAX: (int) 46 (0) 414 243 05 Phone: (int) 46 (0) 414 243 05 ==========================================================================
| Вернуться в корень Архива |