Using a computational server

           

Using a Computational Server


If this Red Dwarf quote strikingly reminds you of your computational biology skills, keep reading ;).


CAT
  Well, speaking personally, I hardly didn't get no formal education at all.
LISTER
  No kidding, professor...
CAT
  No, it's true, bud.  That's why, sometimes, I don't know stuff.  Like... well, practically everything.
KRYTEN
  Was this because you brought yourself up, sir?
CAT
  Right.  There was no one else around, so I had to teach myself.  And seeing as I didn't know anything to begin with, lessons were long and slow; especially on Thursdays when I had double nothing.


I. HOWTOs

How to create a user account?

You most likely have an administrator who will start you an account. If you have sudo rights, you can create new users by typing: sudo adduser username. You can  add a user into sudoers list with sudo visudo command and inserting username ALL=(ALL) ALL at the end of the opened file.

How to connect to the server?

Linux and Mac OS users:
1) open terminal (Ctrl+Alt+T in Linux)
2) type ssh username@servername
3) type your password (confirm authorization by typing yes if needed)
4) you are now in your home directory /home/your_username

Windows users:
1) download, install, and use PuTTY [http://www.putty.org/]

How to change your temporary password after signing in?

1) type passwd your_username
2) type your new password and confirm

How to show running processes and available resources?

1) type htop (or top)
2) see this page to understand its output: http://www.deonsworld.co.za/2012/12/20/understanding-and-using-htop-monitor-system-resources/
3) press Q to quit

How to copy data between computers?

Linux and Mac OS users can use SCP (Secure copy)

Copying file to server:

scp SourceFile user@host:directory/TargetFile

Copying file from server:

scp user@host:directory/SourceFile TargetFile

Copying directory from server:

scp -r user@host:directory/SourceFolder TargetFolder

Copying directory to server:

scp -r SourceDirectory user@host:directory/


Windows users can use WinSCP [http://winscp.net/eng/index.php] or any other SCP client.

How to make/copy/move/rename directories/files, and move around in the file system?

Read any introduction into Linux. Software Carpentry tutorials [http://software-carpentry.org/lessons.html] are very useful.

To display manual/help for a particular command:

man commandname

To print your current working directory:

pwd

To list all files in your current folder:

ls

To go into your home directory:

cd ~

To go into a directory inside your current directory:

cd downdir

To change your curent directory one level up:

cd ..

To go into a particular directory given by an absolute path:

cd /home/filip/data/

To make a directory:

mkdir directoryname

To remove a file or an empty directory:

rm filename

To remove a directory with files in it:

rm -rf dirname

To move or rename a file:

mv

To copy a file:

cp filename newfilename

How do permission and ownership rights work in Linux? Can other users open/copy (read permission) or modify (write permission) my data? How to make a script executable?

Read this article at http://linuxcommand.org/lts0070.php or any other article on Linux permissions. You only need to get familiar with two commands: chmod and chown.

By default, other users can open/copy all your files, but not modify them. If you have some folders/files which you do not want to be accessible by other users, type:

chmod -R 700 dirname

To change owner of a file or directory (helpful for sudoers):

sudo chown filip:filip filename
sudo chown -R filip:filip dirname

To make a script executable:

chmod u+x scriptname

How to compile and run programs?

1) First check if the program is already available system-wide, type:

which binaryname
whereis binaryname

Or if you don’t know the binary name, you can try:

locate programname


2) If it's not installed, you can do it yourself
Usually, executable binaries are available for download. Download specific binaries for our system (called something like exe Linux 64bit), make them executable (see above), and you can run your program locally in your folder by typing:

./programname

Sometimes, programs require to be compiled from source. Compilation usually differs a lot for various programs. Read Readme/Install files, manuals,... General procedure is to type ./configure followed by make.

3) How to put your program into your $PATH environmental variable?
Executable binaries (“programs”) can be run from any Linux folder given that they are present in a folder included in your PATH. This is very very useful feature. To tell which directories are in your path, type: echo $PATH | tr ':' '\n'. There is a hidden .bashrc file in your home folder. While in your home, type nano .bashrc and add a line like this: export PATH=/home/yourname/programs/programname:$PATH at the end of the .bashrc file to include your program-containing folder(s) into your path.

Where to put binaries/symlinks to make programs available to all users?

There are many ways how to make 3rd party programs available to all server users. To reduce redundancy of widely-used programs being installed by many users in their home directories, sudo users can compile/move binaries into e.g. the /opt/src directory and symlink binaries with the /opt/bin directory (has to be added to every user’s PATH).

How to use Screen to keep your process running after you close your terminal window?

1) before starting your analysis, type screen and confirm by pressing Space (or type screen -S analysisname to name your screen, then you can reattach using just its name)
2) start your analysis
3) press Ctrl+A+D to dettach from your screen
4) now you can log out from the server and your process will keep running
5) to reattach to the running screen process, type screen -ls to see running screens
6) type screen -r name_of_the_screen_to_reattach

There are other ways how to do the same thing (& and disown), but they are much less convenient than the screen command.

Where can I get more information on computational biology?

    1) There are many workshops where you can get hands on experience, see http://evomics.org/.
    2) If you're looking for books, get Practical Computing for Biologists [http://practicalcomputing.org/] or similar books from O'Reilly [http://shop.oreilly.com/category/browse-subjects/science-math/bioinformatics.do].
    3) Search google and internet forums for your questions [http://seqanswers.com/, http://www.biostars.org/, http://stackoverflow.com/].
    4) If you know someone experienced, bug him/her with questions and pick his/her brain.

Do I really need to learn at least one programming language for genomics?

Yes, you do and practically for all biology, but try to tell this to biologists...;). And you'll probably need perhaps more than one language. I'd suggest to start with Bash (you're already using parts of it and it's pretty simple) and then move on to Python [these two books are pretty awesome: http://pythonforbiologists.com/books/index.html], but you can do the same with Perl or R and I'm pretty sure you'll meet these languages anyway during your learning curve.

How many threads can I use for my analysis (to be not considered selfish)?

Talk to other users! An unspoken rule in many labs is to use less than ⅔  of all threads. Feel free to use as many threads as needed during weekends and holidays (or if you see that nobody's using the server e.g. overnight), but always leave one or two nodes for others to use for simple tasks. If you want to use more processors, change the process priority (nice/renice commands) to be lower than your default and basically act as an transient process using resources only when available. This is what I usually do with most of my processes.

How can I compress and decompress files?

To tar and compress a file using tar and gzip or bzip2:

tar -zcvf futurefilename.tar.gz filetocompress
tar -jcvf futurefilename.tar.bz2 filetocompress


To untar or decompress a file that was created using tar:

tar -zxvf filename.tar.gz
tar -jxvf data.tar.bz2


To compress a file by gzip or bzip2:

gzip filename.gz
bzip2 filename.bz2


To decompress a .gz or .bz2 file:

gunzip filename.gz
bunzip2 filename.bz2

Which blast or hmmer databases are available and how to add a new one for all users?

Have a look (echo command) at $BLASTDB and $HMMERDB environmental variables. If they're set, it means that you can use database names (such as nr or nt) for your blast searches without specifying absolute path for these databases and blast and hmmer should be able to find the particular database files. If you want to add some large database into this folder or if you need to update some of the databases, contact your admin. Often, there is really no reason to blast against the huge and poorly annotated databases such as nr and nt, try to use RefSeq, SwissProt or other properly curated alternatives as much as possible.

How can I get easily parsable tabular blast output including species names?

Use user specified tabular (or XML) output with sscinames in it. Read BLAST manual for more info: http://www.ncbi.nlm.nih.gov/books/NBK1763/#CmdLineAppsManual.Quick_start. NCBI taxdb has to be in your/our $BLASTDB environmental variable.

-outfmt '6 qseqid sseqid evalue bitscore sgi sacc staxids sscinames scomnames stitle'


My analysis interferes with another analysis currently running (e.g. for RAM). Is there a way to pause this analysis, release its used memory and restart it when the other analysis is finished? If you are running your analysis inside a screen, you can reattach to it, pause it with Ctr+Z, and then deattach. To restart it: reattach again to the screen, restart your analysis by typing fg, and then deattach your screen. This should work and eventually release memory in most of cases. If not, some programs save checkpoints, so you can kill the job and then restart from the last saved checkpoint.

I would like to use a program with graphical user interface (IGV, IGB, Artemis, PathwayTools, ...), can I use the server for it?

Yes, you can, but I cannot guarantee you that it will be fast enough for serious work because it can be painstakingly slow. There is no other way than try it and see if it limits you in any way. Since these programs are usually really easy to install and not memory/CPU demanding, why not just use your laptop?

How can I install Perl modules?

If you have sudo rights, using CPAN is extremely easy. Simply type:

sudo cpan 

Then specify which module you need to install, e.g.:

install Getopt::Long


There are many ways how a non-sudo user can install modules just for him/her-self.
Simplest solution is to append $PERL5LIB environmental variable at the end of your .bashrc file like this:

cd ~
echo 'export PERL5LIB=/home/yourname/my_perl_modules' >> .bashrc

Then doublecheck that it got set by printing its content:

echo $PERL5LIB

How can I install R packages and particularly Bioconductor modules?

If you have sudo rights, type the commands below.

To install R packages:

sudo R
install.packages("modulename")

To install Bioconductor modules:

sudo R
source("http://bioconductor.org/biocLite.R")
biocLite("modulname")

How can I install Python modules?

If you have sudo rights, type one of the two following commands:

sudo pip modulname
sudo easy_install modulname

Which Java/Perl/R/Ruby/Python/PHP/SQL version is installed?

Type:

java -version
perl -version
R -version
ruby -version
python --version
python3 --version
php --version
sql --version

How can I switch between java from Oracle and OpenJDK?

To switch between available java version [for sudoers only], type the command below.

sudo update-alternatives --config java

How to keep my folder from expanding into many TBs of data?

Bam and sam files are usually enourmous: keep only one of them, use --no-unal in bowtie2, pipes in samtools, and other ways how to keep disk space usage low. Do not frequently copy and paste huge raw files from data folders, use their paths for programs to find them when using them for assemblies, mapping. If you trim and error-correct your files prior to assemblies, do it only once and keep the corrected files! If you do not use a file, compress it. Especially if it's a fastq or sam file. Many programs can use gzipped files directly.

To get human readable info for files/directories in your current folder, type:

du -sh *

To find all your files bigger than 10 GB in your home folder, type:

find ~ -size +10G

My text files (fasta, phylip, nexus, ...) work when using my Windows (or an old Mac) machine, but they don't work when uploaded to the Linux server. What's wrong?

Characters used to define line breaks in text files differ between different operating systems and most of programs cannot deal with it [http://en.wikipedia.org/wiki/Newline].
Windows systems use a combination of a carriage return (CR) and a line feed (LF) mostly because of historic printer-compatibility reasons.
All Unix systems use line feed (LF) only. Old Macs used to use carriage return (CR) only, but newer Macs use the same line break (\n) as in Linux. Just in case you have some old text files from Macs, mac2unix utility is also installed.

To figure out origin of your text file, type:

file filename

To convert these line ends, type one of the commands below (it's pretty self-explanatory):

dos2unix filename
unix2dos filename
mac2unix filename
unix2mac filename

II. MISC. TIPS AND TRICKS

Regular expressions
- this is in my opinion one of the most important things to learn for computing in biology
- just google "regex cheat sheet", there are tons of tutorials and cheat sheets available
- if you often need to extract and modify text strings in huge files (and excel is slow or runs out of memory), these expressions can do the same thing and are really snappy
- once you manage the basic ones, you can use them in grep, sed, awk, perl, python, ... you name it
- be careful, though, and always test them properly with toy data sets as they can get pretty funky and idiosyncratic (not only for newbies...)

Keyboard shortcuts
Ctrl+R -- to search in your bash history (all your previous commands)
Ctrl+D or Ctrl+C -- to kill the process running in your terminal
Tab -- to autocomplete commands/directories
Up and down arrows -- to show recently used commands

To open the vim editor and start practicing, type:

vim testfile.txt 

To open very simple command line text editor:

nano 

To print a text file:

cat filename 

To print a text file so that you can scroll down:

less filename

To download a file from an internet adress:

wget URLadress 

To search for files:

find filename 

To print  the  first  10 lines of a file:

head filename

To print  the last  10 lines of a file:

tail filename

To join lines of two files on a common field:

join filename1 filename2

To split a file into pieces:

split

To remove sections from each line of files:

cut

To merge lines of files:

paste

To sort lines of text files:

sort

To translate or delete characters:

tr

To report or omit repeated lines:

uniq

To transfer a URL:

curl

A general purpose distributed information browser for the World Wide Web

lynx

To display a line of text:

echo

To format and print data:

printf

To print lines/words matching a pattern:

grep

To filter and transform text:

sed

To read from standard input and write to standard output and files:

tee