PGATK File Formats¶
The ProteoGenomics Analysis ToolKit is based on standard proteomics formats developed by HUPO-PSI and Genomics Standard file formats. This section is to highlight in 10 minutes the most important features of those file formats, How they are used in PGATK and you can contribute to their development.
Note
It is important to notice that this Help page only provides the fundamentals of each file format used in PGATK, for more details we provide links to the original documentation of the File format.
BED Format¶
BED¶
BED *(Browser Extensible Data) format provides a flexible way to define the data lines that are displayed in an annotation track UCSC Bed Definition. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used.
Note
If your data set is BED-like, but it is very large (over 50MB) and you would like to keep it on your own server, you should use the bigBed data format.
The pedBed Fields and Properties supported by PGATK:
Field (bold are required) | Description | Example |
---|---|---|
chrom | The name of the chromosome | chr3 |
chromStart | The starting position of the feature in the chromosome or scaffold | 1000 |
chromEnd | The ending position of the feature in the chromosome or scaffold | 5000 |
name | Defines the label of the BED line. GPATK annotate peptide sequence | K(Phospho)SR |
score | A score between 0 and 1000. | 1000 |
strand | Defines the strand. Either “.” (=no strand) or “+” or “-“. | |
thickStart | The starting position at which the feature is drawn thickly. | |
thickEnd | The ending position at which the feature is drawn thickly | 5000 |
itemRgb | An RGB value that will determine the display color of BED line color | (0,0,255) |
blockCount | The number of blocks (exons) in the BED line. | |
blockSizes | A comma-separated list of the block sizes. | |
blockStarts | A comma-separated list of block starts. | |
proteinAccession | Protein accession number | |
transcriptAccession | Transcript Accession | |
peptideSequence | Peptide Sequence with no PTMs added | |
proteinUniqueness | Peptide uniqueness (See color code color) | |
transcriptUniqueness | Peptide uniqueness (See color code color) | |
genomeReferenceVersion | Genome reference version number | |
psmScore | Best PSM score | |
fdr | False-discovery rate | |
modifications | Coma separated list of Post-translational modifications | |
peptideRepeatCount | Peptide Counting | |
datasetAccession | Dataset Identifier | |
uri | Uniform Resource Identifier |
Hint
If the field content is to be empty the space should be field with a “.”
Note
BED input files (and input received from stdin) are tab-delimited. The following types of BED files are supported by PGATK:
- BED4: A BED file where each feature is described by chrom, start, end, and name. (e.g. chr1 11873 14409 VLADIMIR)
- BED6: A BED file where each feature is described by chrom, start, end, name, score, and strand. (e.g. chr1 11873 14409 VLADIMIR 0 +)
- BED12: A BED file where each feature is described by all twelve columns listed above. (Default option in all tools)
- BED12+11: A complete Bed file including required fields and optionals.
Color¶
Uniqueness Colors:
Colour | Description |
---|---|
Peptide is unique to single gene AND single transcript | |
Peptide is unique to single gene BUT shared between multiple transcripts | |
Peptide is shared between multiple genes |
Modified Peptides Colors:
Like BED but containing the location of the post-translational modification on the genome. Thick parts of the peptide blocks indicate the position of the post-translational modification on a single amino acid (short thick block) while longer blocks indicate the occurrence of the first and last post-translational modification and residues in between. In the PTMBED the colour code is changed to indicate the type of modification.
Additional Files formats¶
Peptide Atlas Peptide List¶
PeptideAtlas released every month/year a list of peptides that has been found/identified by MS/MS (see the list here). The PGATK support the output list as input of some of the tools such as pepgenome .
Column | Field | Description |
---|---|---|
1 | peptide_accession | Peptide Accession (PAp06389395) |
2 | peptide_sequence | Peptide Sequence |
3 | best_probability | Best Peptide Probability |
4 | n_observations | Spectral counting |
More properties not used |
Hint
For our pipelines and tools the order of the column is important.
Note
A full pipeline to map the PeptideAltas peptide evidences to Genome Coordinates can be found here.