27

Comma separated value (.csv); FASTA (.fas) and Genbank (.gb) sequence files
There are three types of file formats that j5 uses for input or output: comma separated value, FASTA and Genbank.

Comma separated value (CSV) files

CSV files are essentially just text files that can easily be arranged into a spreadsheet-like array. Every text line in the CSV file represents a row, and columns are delimited in a line with the use of the comma (,) character. Since commas delimit columns, including a comma itself in a field necessitates that the entire field be enclosed in double quotation marks (e.g. "Last name,  First name" instead of Last name, First name). Since the double quotes (") character is used to enclose/flank an entire field containing special characters, such as commas, double quotes, if to be contained within a field, themselves must also be enclosed in double quotation marks (e.g. "Hannibal "the Cannibal" Lecter", instead of Hannibal "the Cannibal" Lecter). An empty, or undefined field, is represented as two commas immediately following one another, or with empty double quotes (e.g. ,, or ,"",). Besides viewing and editing CSV files in any text editor, CSV files can be readily opened, viewed, and edited with spreadsheet software such as Microsoft Excel, OpenOffice, etc. Note that depending on how your particular spreadsheet software defaults to opening a CSV file, you may need to adjust the displayed column widths to see the entire contents of a given cell or to see several columns on the screen at the same time.

For more information, see:
http://search.cpan.org/~hmbrand/Text-CSV_XS-0.72/CSV_XS.pm

FASTA (aka Pearson) format sequence files

A FASTA file is a text file that may contain one or more sequences. FASTA files do not contain sequence annotation. For the purposes of j5, and for maintaining well documented sequences in general, the Genbank file format (see below) or the jbei-seq format are much preferred. j5 will auto-annotate FASTA sequences with a "misc_feature" type feature whose label matches the display ID of the sequence (i.e. what follows the ">").

Note that there is no standard file extension for FASTA files; j5 uses (.fas).

For information about the FASTA format, see:
http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
http://www.bioperl.org/wiki/FASTA_sequence_format

Genbank format sequence files

Genbank format sequence files generally contain a single sequence each (although at times, you may see multiple ORF translated sequences embedded as feature annotations within a long DNA sequence). Feature annotations within Genbank format files are extremely useful for being able to view a DNA sequence at a higher/more functional level, and allow for rapidly checking if a designed DNA assembly process will result in the desired sequence.

Currently, the "label=" field for a given feature annotation is used to determine if two features (with the same name/label) should be spliced at a DNA assembly junction.

For more information about the Genbank format, see:
http://www.bioperl.org/wiki/GenBank_sequence_format
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html