FAQ - Frequently Asked Questions
Here are some information that isn't available on the Distributed Folding website. This is NOT the official FAQ for the Distributed Folding project!
Overview
Important info regarding the protein change procedure.
Making filelist.txt for foldtrajlite (the Distibuted Folding client)
Addons: Distributed Folding GUI
Addons: Foldmonitor
Addons: KDFold
Running Hidden/Win9x auto-start/etc.
How to enable large buffering (recommended!)
Hey - something is wrong with my stats?!
What the heck does RMSD stand for and what does it mean?
Further information on the publication of results and who owns the intellectual property generated by the project
During the CASP5 competition, will there be any indicators of structure prediction accuracy similar to the RMSD measurement we are accustomed to?
Why does Distributed Folding not predict every CASP5 candidate?
What is the fastest processor for running the DF client and why?
What part of the processor does DF stress the most?
What is more important for good performance when using the Distributed Folding client - frequency (clock speed, i.e. mhz/ghz), L2 cache size, Front Side Bus (FSB) bandwidth, memory settings (timings, etc.)?
How do the various major processors and platforms perform with the Distributed Folding client?
Does the size of the L2 cache on a processor have a large impact upon performance?
How much slower will each 'instance' of the Distributed Folding client in a dual-CPU computer run than a single 'instance' of the Distributed Folding client on a single processor?
What is faster, Windows or Linux?
What is the "-qt" or "quiet" mode, how do I use it, and why should I use it? [primarily applies to Windows 95/98/ME users]
Is support and optimization for high performance instructions sets, such as Altivec and SSE/SSE2 planned or likely?
Does anyone run the Distributed Folding client on a LARGE network of computers and, if so, how do they automate the installation, operation and auto-updating of so many computers?
On Linux, how do I get the Distributed Folding client to start automatically and to run in the background (invisibly)?
When an auto-update occurs on Windows and it restarts the client after download, does it call foldit.bat again to re-start or just jump up to the :start tag in the batch file so it is the same batch file continuing to run?
How do I safely stop the Distributed Folding client via script, a command in a cron job, etc?
How do I make it run twice as fast?!
What is Sneakernetting?
How do I sneakernet?
Important info regarding the protein change procedure.
For each protein we are working on, we need to generate a certain amount of structures. You can follow the progress right here. Depending on how you have configured the client, it will upload for every xxxx structures your PC has generated (the number depends on the size of the protein). Every time it uploads, it also checks whether a new version (a new protein) has been released. If a new protein has been released, it will either ask if you want to download the new version (the default setting) or it will download it automatically. The important thing to know is that when we change to a new protein, the old structures have become obsolete. If you are running the client on a PC without a net connection, it is therefore very important to watch the progress and upload your generated structures before the change or you risk loosing a lot of work.
Making filelist.txt for foldtrajlite (the Distibuted Folding client)
If you run Distributed Folding (DF) with the -df (large buffer) switch, your client may cache a large number of *.val.bz2 and *.log.bz2 files if it is unable to upload them to the server for an extended period of time. The purpose of the filelist.txt file is to record these cached files, and to maintain a list of the files that need to be uploaded the next time the client is able to contact the server. While this seems all well and good, what may actually happen if your client has trouble uploading them due to server load, network conditions, etc, is that your filelist.txt may be deleted, or trimmed down to only include a few files. The client will then upload those few files and the rest will remain on your hard drive while the client continues producing and uploading new work. I suspect this is done to lighten the load on the server, but the result is: Lost work for you!
This small guide is aimed at providing an easy way to rebuild a filelist.txt file so that your client will upload those "orphaned" structure files. These directions are written for Windows NT/2K, but probably work on Windows 9x/ME as well.
Open a command prompt or MS-DOS window (start >> run >> cmd is my favorite way to get there).
Switch to the directory where your DF install is. cd \distribfold is where I keep mine.
Type dir *.bz2 /B /O-D > filelist.txt
Note: On Linux/Unix you can do it this way: ls -t *.bz2 > filelist.txt
You should now have a file in your DF directory named filelist.txt and it should contain a list of all of your "orphaned" work files. See notes below for an explaination of what all those switches in that command do.
Open filelist.txt in your favorite text editor.
Look for a line with rotlib.bin.bz2 on it. Delete that line.
Next, notice how most of the files are "paired". One .log.bz2 file, then one .val.bz2 file on the next line. Look all down through the list, to be sure that each two lines is a "pair". It's common to have extra .val files, usually near the end of the filelist.txt. If you have some extras, remove them, but be sure to leave the one directly below the last .log file!
That's pretty much it. You're ready to close and save your file, and begin an upload.
Important: Be sure to use the -df in the foldit.bat file. Without the -df, the client will not upload more than
6 pairs and if you have more than 6 pairs, the rest will be deleted!
When you have no more .log.bz2 files to upload, you have no more finished work. IF there are extra .val.bz2 files remaining, they are safe to delete. Good Luck!
Explaination of the dir command given above for those interested:
All of this info can be found in the help info by typing dir /?
*.bz2 - Obviously, we only want to list the .bz2 files.
/B - Cuts off the dates, file sizes, etc. Only lists filename.
/O-D - Orders the list by DATE. The "-" means reverse order. This is done because the .val are created before the .log, and they
need to be the other way (.log then .val) in the filelist.txt.
> filelist.txt - Instead of output to the screen, we write the results of the dir command to the file.
For further info, have a look in this thread on the official project forum. In that thread, it is described by Howard Feldman which is one of the persons behind the projects.
Addons: Distributed Folding GUI
A new but very cool little program written by Digital Parasite. It lets you monitor the client and other stuff. He is still working on adding several other interesting things, but it is already very nice.
You can find more and updated info and download it on this website.
Addons: Foldmonitor
A nice little monitor program (PerlTk script), that runs on both Linux and Windows. More info and download on the website.
Addons: KDFold
Another nice monitor program written in Kylix and therefore also available for both Linux and Windows. More info is available here.
Running Hidden/Win9x auto-start/etc.
The Knights Who Say Ni! has created a nice little tutorial and a little program for this, which you can find right here.
How to enable large buffering (recommended!)
Step 1:
Edit the "foldit.bat" batch file in the distribfold\ directory. You will see
something like this at the start of the file:
@echo off
:start
.\foldtrajlite -f protein -n native
The third line contains 'parameters' or 'arguments' (not sure what the proper term would be) for running the client .exe (foldtrajlite).
You want to add a "-df" 'switch' to this line so that it looks like this:
.\foldtrajlite -f protein -n native -df
This will enable much larger buffering, which will allow you avoid the error message and halting of the client after 6 sets of structures are completed (30,000 proteins). The -df switch will allow your client to keep on crunching, even if it isn't able to contact the server to upload results. Also, if you have a larger number of structures to return you NEED the -df switch in order to properly upload the results (otherwise it can 'lose' some of them). With the -df switch enabled you don't have to worry about losing results like this.
Hey - something is wrong with my stats?!
The only unique identifier in this project is your handle. The only person that knows your handle is you. The username you see in the stats is not unique. You are the only one that knows your handle because of security and privacy reasons (which is a good thing!). Since there is no way for the persons that provides stats (like Dyyryaths cool stats) to identify you, they have to use your username combined with what team you are on and your organisation. If you change these (which you can do), the "stats engine" will not be able to identify you and your stats will appear to be wrong.
What the heck does RMSD stand for and what does it mean?
RMSD stands for Root Mean Square Derivative. This measures the difference between a result and a known result, in the case of the CASP4 protiens used to test the DF client in the early stages, this meant comparing the structure created by the DF client to the structure that was considered the 'winning' structure prediction for the given protein during the CASP4 trials. Basically, RMSD measures how "close" the structure of a protein, as predicted by the DF client, is to the closest known structure for that protein. A smaller RMSD is not an absolutely measure of protein prediction quality, but is a reasonably reliable method of judging accuracy. A rough guide is that anything under 6.0A RMSD is of use, anything under 4.0A RMSD is pretty good and anything under 1.5A RMSD is VERY good.
Further information on the publication of results and who owns the intellectual property generated by the project:
I cannot tell you anything for certain, but from what I understand, any intellectual property generated by this project is the property of Mount Sinai Hospital. While there may possibly be some profit involved, the Hospital is itself a not-for-profit organization meaning that any proceeds would be reinvested into more research (to our projects and others in the institute)
Users will not be reimbursed in any way for their time of course (hence Intel's term 'philanthropic' computing), other than the personal satisfaction of having helped make great discoveries and helped science progress (and getting your name in the stats pages!).
Any discoveries made will be published in scientific journals, and may indeed be patented as well if appropriate (before publication as Jodie pointed out). But the act of publishing our findings to the scientific community effectively makes the knowledge we discover public domain.
Parts of our source code are open source, but other parts are not. We may possibly release parts of the screensaver and client code (minus the actual folding algorithm) in the future but for now those will remain closed source.
-Howard Feldman
During the CASP5 competition, will there be any indicators of structure prediction accuracy similar to the RMSD measurement we are accustomed to?
Finding an RMSD for proteins used in the CASP5 competition is not possible, because there is no known protein structure for these trial proteins with which to compare our predictions. However, one of the many methods used by the scientists at SLRI to evaluate the likely accuracy and quality of a protein structure prediction has been integrated into the scoring and results of the CASP5 protein predictions. This takes the form of an energy minimization algorithm. It is not as 'accurate' as RMSD and is NOT the only method being used to test protein structure prediction, but it can help to give a very general indicator of the quality and accuracy of a given structure prediction. In regards to energy minimization, a lower score is indicates a greater likeliness of an accurate prediction.
Why does Distributed Folding not predict every CASP5 candidate?
The Distributed Folding project is targeted at a certain type and size of protein. Proteins that appear compatible with the DF methodolgy will be predicted using the DF client. Of the proteins which are not predicted using the DF client, some will still be predicted by the scientists at SLRI using other prediction methods (i.e. more traditional scientific methods)
What is the fastest processor for running the DF client and why?
MHz for MHz, Alpha is definitely THE fastest processor for distributed folding. The majority of the program's time, as I have mentioned before, is traversing pointers (specifically in binary-tree-like data structures). This accounts for 50% or more of its time. Another good but smaller chunk is spent RLE decompressing the data in protein.trj, the protein data file. The expanddb utility that originally came with foldtrajlite uncompressed protein.trj, but we found this made things slower, not faster, probably due to increased loading from disk.
-Howard Feldman
What part of the processor does DF stress the most?
The algorithm spends the majority of its time doing pointer traversal (following pointers down a tree-like structure). IT is fastest on Alpha CPUs which seem to perform this sort of operation best but you are correct, it basically depends most on raw CPU power.
-Howard Feldman
What is more important for good performance when using the Distributed Folding client - frequency (clock speed, i.e. mhz/ghz), L2 cache size, Front Side Bus (FSB) bandwidth, memory settings (timings, etc.)?
According to user-submitted benchmarks, the Distributed Folding client does appear to rely primarily upon clock speed when comparing relatively new and well designed CPU architectures.
How do the various major processors and platforms perform with the Distributed Folding client?
The AMD K7 series of processors, also known as the Athlon series, perform very well with the Distributed Folding client. Processors based upon the K7 core perform very well on a clock per clock basis when compared with other processor architectures and also have absolute performance that appears to be matched or exceeded only by higher end Nortwood core Pentium 4 CPUs. The K7 line of processors do scale in performance with the enhancements made to the processor core, such as with the "Thunderbird" and "Palomino" cores. Using DDR SDRAM with these processors provides a performance increase but is not required to get good performance out of the "Athlon" or "K7" processors. Any Athlon XP processor using DDR SDRAM on a quality chipset (ex. KT266a/KT333/SiS735) will provide very high performance on the Distributed Computing project and this appears to be the best price/performance value available.
The G3 and G4 appear to perform approximately equal to a Pentium III CPU at a given clock speed. However, the G3 and G4 processor lines do not scale to the high clock frequencies which processors based upon the Pentium 3 core can. Thus, absolute (or overall) performance on Macs is about the same as a mid-range Pentium 3 processor and can not compete in terms of absolute performance with AMD Athlon CPUs based upon the "Thunderbird" (or later) core, nor can they compete with high end Pentium 3 processors or even the entry level Pentium 4 processors. If you have a Mac with a G3 or G4 processor and you are not running the distrubted.net projects (RC5 & OGR) on it, then by all means run the Distributed Folding client on it. The client seems to run well (relative to other Distributed Computing projects) on Macs.
The Pentium 4 processor appears to perform quite well with the Distributed Folding client. Performance per clock trails AMD's K7 line of processors, but the difference in performance per clock is much smaller when running the Distributed Folding client than it is with most other Distributed Computing clients (GIMPS/Prime95 being the obvious exception). Since the Distributed Folding client craves clock speed over almost all other CPU resources, the Pentium 4 (especially of the Northwood variety and the overclocked variety) is arguably the absolute performance leader in Distributed Folding. Top of the line Pentium 4 systems and top of the line AMD AthlonXP systems leave behind all other competitors in terms of absolute performance. Both show large potential for performance increases as a result of overclocking. With the limited sample of benchmarks submitted so far for this project, it has not been possible to clearly determine which processor (the AthlonXP or the Pentium 4 Northwood) has the greatest absolute performance in Distributed Folding, but if the lead isn't already in the hands of the Pentium 4, it soon will be as the clock speeds of the Pentium 4 family continue to increase at a much more rapid pace than the Athlon line of CPUs can possibly hope to match.
AMD Duron processors have lesser performance per clock cycle than thier Athlon counterparts, but still perform quite well (they should still tend to outperform the majority of Pentium III CPUs). The difference in performance between the Duron and Athlon processors seems to come from the obvious places - the decreased L2 cache size, the lower clock speeds and the slower FSB.
The Pentium 3 processor performs quite well on a per clock basis in comparison to other processors (approx. the same as the G3/G4 and slightly less than the early Athlons). The Pentium 3, however, simply can not keep pace with the Athlon XP and the Pentium 4 Northwood in terms of FSB bandwidth and clock speed.
The Celeron takes a noticeable performance hit in comparison to the Pentium 3. It is unclear what amount of L2 cache is the minimum to avoid a performance hit, but the L2 cache of the PII based Celeron appears to be below that limit. Performance is still acceptable, but it cleary trails all of the K7 based processors and the Pentium 3 line of CPUs.
Pentium II CPUs are slower than the Pentium III CPUs, for obvious reasons, but the performance per clock does not appear to be significantly lower.
Does the size of the L2 cache on a processor have a large impact upon performance?
Using the limited sampling of benchmarks that have been collected so far, there DOES appear to be a sweet spot in terms of L2 cache size. The L2 cache of the Duron and the Celeron appear to be below this 'sweet spot' and the Pentium III, Athlon and Pentium 4 all appear to have a sufficient amount of cache (the size of the L2 cache on these processors varies, but all have an L2 cache size of atleast 256k). The increased frequency of the L2 cache that came with moving the L2 cache onto the processor die (with the P3 and Thunderbird processors for Intel and AMD, respecitively) appears to have improved performance, by a significant but unknown amount.
L2 cache size beyond 256k does not appear to impact performance in any significant fashion, if at all. The reason for this is unknown, but since the DF client is "probabilistically sampling conformational space" or doing a "kinetic 3 dimensional walk" or some other such scientific mumbo-jumbo (essentially, it has a pseudo-random progression through 'tree-like' data-sets and appears to re-use very little in the way of instructions or data) rather than doing extensive computation upon a (relatively) small set of data, such an FFT (which is used in both SETI and Prime95 I believe), there is no performance benefit from being able to fit the entire data set (in the two previously mentioned programs, the FFT specifically) in the L2 cache. So, your large cache Xeons aren't going to outperform your similarly clocked Pentium 3s by much, if any when running the Distributed Folding client.
How much slower will each 'instance' of the Distributed Folding client in a dual-CPU computer run than a single 'instance' of the Distributed Folding client on a single processor?
Distributed Folding appears to have little performance penalty for running a second instance of the client on a dual CPU system. The performance delta between CPU0 and CPU1 are small (approx. 5%) and the performance delta between CPU0 and an identical CPU running in a single CPU system with a single instance of the client is also small.
What is faster, Windows or Linux?
The Distributed Folding client appears to perform significantly faster on Linux as opposed to Windows based systems. WindowsNT appears to be a bit slower than Windows9x based operating systems (when the client is tweaked for best performance). However, Windows9x systems with performance tweaked clients (specifically the usage of the '-qt' option) appear to approach the performance levels of Linux.
What is the "-qt" or "quiet" mode, how do I use it, and why should I use it? [primarily applies to Windows 95/98/ME users]
If using Win9x based operating systems with the Distributed Folding client and seeking optimum performance, it is highly recommended that you run the foldtrajlite.exe client (the DF application) with the '-qt' switch enabled. The '-qt' switch makes the client run in "quiet mode," which disables all output to the console. "Quiet Mode" applies only to the Command Line Client, as the Screensaver client and the WindowsNT service install do not open console windows. On Windows NT based operating systems and on Linux operating systems, running in "quiet mode" only increases performance by a small amount. I attribute the performance hit in Win9x to some sort of bug or some way in which the operating system poorly handles the calls made by the client. The amount of processing power required to display the information in the console window of the Distributed Folding client is very small and is completely out of proportion to the performance penalty witnessed with the client running on a Win9x OS. However, this is merely speculation as to the cause of the poor performance in Win9x.
The performance of the DF client with the console window displaying full information on all other operating systems is very likely to be the same as on Windows NT and Linux - only slightly slower than running the client in "quiet mode."
Note: if you do not know how to modify the foldit.bat file to configure the way in which the DF client operates, I strongly recommend that you use Jeff Gilchrist's dfGUI which can be found at dfGUI and allows you to easily select the configuration options for the client, as well as many other useful features for monitoring the performance and progress of the DF client.
Is support and optimization for high performance instructions sets, such as Altivec and SSE/SSE2 planned or likely?
Because the client spends most of its time in 'tree-like' pointer traversals, programming in support for special instruction sets like Altivec and SSE/SSE2 has little benefit and isn't likely. However, the client code has already been heavily optimized by the DF team.
Does anyone run the Distributed Folding client on a LARGE network of computers and, if so, how do they automate the installation, operation and auto-updating of so many computers?
[the following is blatantly cut and pasted from a thread on the forum of The Knights Who Said Ni!, where KWSN-MilleniumGuy2001 made the following post in an attempt to share the method by which he maintained his very large network of computers running the Distributed Folding client - your mileage may vary, this is meant only as a general guide to point you in the right direction]
For Distributed Folding, I did visit each workstation to set up a special DistributedComputing fileshare.
In some deep, hidden directory I created a directory called DP. Then I share that directory as DP$ to keep the share hidden from everybody who does a casual browse of the network. Under that directory I create another folder for the DistributedFolding files, lets call it DF. You will note that by creating the share one level up, it is easy to add another directory for F@H, G@H, Seti or whatever, and all of these projects can make use of the same fileshare.
Next I copy all of the files (from a network share) and run the "foldtrajlite.exe /install" command. I have a batch file that does these two steps. Once the program is installed as a service you never need to visit the workstation again.
I manage the service and change it from automatic start to manual start. I think that if a user decides that there is something wrong with their machine and they reboot it during the day then the service shouldn't launch itself again, just in case it's the extra service that is creating their problems.
Off the Windows 2000 resource kit there is a file called "netsvc.exe" which allows you to remotely manage all services. The commands are quite simple:
netsvc /start ComputerName "Distributed Folding Project Service"
and
netsvc /stop ComputerName "Distributed Folding Project Service"
Howard's service handles the file save correctly if you shut down the service this way. Alternatively, to shut down the service you can safely run this command:
del \\ComputerName\DP$\DF\foldtrajlite.lock
With some simple scripts you can do things like:
copy \\ComputerName\DP$\DF\progress.txt c:\DP\ComputerName.log
which will show you the last time the progress.txt file was updated and if you open it will show which structure that machine is working on.
I created several scripts to turn on every machine, turn off every machine, get the progress.txt files, delete the project files, and push out the new version of the project files.
KWSN-MilleniumGuy2001
On Linux, how do I get the Distributed Folding client to start automatically and to run in the background (invisibly)?
[the original question, for reference:]
I'm having a bit of trouble getting 'foldit' to start and run in the background. I tried starting it from rc.local, but that prevented me from logging into the box ...ever again The only way to connect to this machine is via SSH from my other linux box. Currently, I just fire it off from my main Linux box, and leave the window minimized. Is there a better way to accomplish this? The box in question is running RH 7.2.
I have my distributed folding client in a directory off my home and I add this to my rc.local: /home/jeff/distribfold/foldit &
But I modified foldit in a very important way: #!/bin/sh cd /home/jeff/distribfold nohup ./foldtrajlite -f protein -n native -df -qt > /dev/null
I use -df so that it will buffer proteins in case it can't contact the server, the -qt is very important so that it won't output information and fill up your nohup or other log files.
With that your client will start automatically when you boot your machine or if you have to start it manually you can just go to your distributed folding directory and run: foldit &
When an auto-update occurs on Windows and it restarts the client after download, does it call foldit.bat again to re-start or just jump up to the :start tag in the batch file so it is the same batch file continuing to run?
NT/2000/XP re-calls foldit.bat from another .bat file while others goto the start tag.
How do I safely stop the Distributed Folding client via script, a command in a cron job, etc?
Deleting the "foldtrajlite.lock" file will halt the Distributed Folding client safely after it finishes the current structure (it will try to upload the results to the server if you are online, as well). While this seems counter-intuitive (deleting a file to stop the client safely) it DOES work. The "foldtrajlite.lock" file is created when the DF client is executed and deleted when the client is properly shut-down. The existence of the "foldtrajlite.lock" file indicates that the DF client is either a) operating properly or b) was closed down improperly. Including a command to delete the "foldtrajlite.lock" file with whatever commands or scripts that you use to launch the DF client is a safe and reliable way to make sure that the client properly starts, even if it was shut down improperly.
How do I make it run twice as fast?!
The latest version of the client has an option that doubles the speed that the client runs at when enabled! The catch is that it requires a substancial amount of ram memory to be free in order to take advantage of this feature. With some of the largest proteins that the client may run in the near future, the client may use up as much as 150MB of RAM with this feature enabled, although the current protein uses up far less. If team members are looking at a way of increasing their production, purchasing extra memory for computers that don't have enough to currently take advantage of this feature is one of the most cost efficient ways of doing so.
For those who feel they can take advantage of this feature, the first step is to download the latest version of the client and install it on the computer(s) you are adding the new feature to. You should then add the -rt tag to the foldit.bat file or its equivalent for different operating systems to enable the feature. For those who have not added tags to enable feature for distributed folding previously and are running the Windows text client, detailed instructions are as follows:
Locate the folder in which you installed the new version of the distributed folding client and select the foldit.bat file. Right click on the file and select the edit option. You should see an open window with different lines of text, and you should locate the line with the text "\foldtrajlite -f protein -n native" you should add to the end of that line the following addition -rt. The line should now look like this.
\foldtrajlite -f protein -n native -rt
You should now save the change to the file, exit the program, and start up the program and enjoy the speed boost!
On Windows, you can also use Jeff Gilchrist's Distributed Folding GUI to enable the option. All you have to do to enable the speed enchancement is click on the proper box and add a checkmark to it in the program's options menu.
What is Sneakernetting?
Sneakernetting is the term used to describe the action of moving results from a non-internet connected machine to an internet connected machine for upload.
How do I sneakernet?
There are two ways to sneakernet:
- If you have a large capacity removable disk fitted on both machines (eg a Zip drive or a CD Burner) then you can simply copy the entire DF directory to the removable disk. Then copy it over to the upload machine and run the client using the upload only switch (-u t). Also, make sure that you do not remove the "use extra buffer" option (if you are using it) otherwise the client may "lose" some of the saved structures. It maybe easier to manage the upload by using DFGUI so that you don"t have to edit the foldit.bat file by hand.
- If you don"t have a large capacity removable disk fitted on both machines then you could use floppy disks and only move the files needed to perform the upload. Setup a client folder on the upload machine with the correct handle.txt file and any other settings configured (eg proxy settings). On the sneakernet machine copy the filelist.txt file to a floppy disk. Open the filelist.txt file and copy each pair of files listed there to the floppy disk. Each pair consists of a "fold_pstc8yw3_XXXXXX_protein.log.bz2" file and a "pstc8yw3_protein_YYYYYY.val.bz2" filewhere the X"s and Y"s are a number. Paste these files in the DF directory on the upload machine and run the client using the upload only switch (-u t). Also, make sure that you do not remove the "use extra buffer" option (if you are using it) otherwise the client may "lose" some of the saved structures. It maybe easier to manage the upload by using DFGUI so that you don"t have to edit the foldit.bat file by hand. It is possible to fit around 160,000 structures of the current protein (129AA) on one floppy disk.
