The Human Skeletal Muscle Proteome Project: a reappraisal of the current literature
- National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
Richard D. Semba,
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD, USA
- National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
- Department of Endocrinology, Odense University Hospital, Odense, Denmark
- Institute of Clinical Research and Institute of Molecular Medicine, University of Southern Denmark, Odense, Denmark
- Thermo Fisher Scientific, West Palm Beach, FL, USA
- National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
Luigi FerrucciCorresponding author
- National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
Correspondence to: Luigi Ferrucci, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA: Phone 410-350-3936, Email: firstname.lastname@example.org
Skeletal muscle is a large organ that accounts for up to half the total mass of the human body. A progressive decline in muscle mass and strength occurs with ageing and in some individuals configures the syndrome of ‘sarcopenia’, a condition that impairs mobility, challenges autonomy, and is a risk factor for mortality. The mechanisms leading to sarcopenia as well as myopathies are still little understood. The Human Skeletal Muscle Proteome Project was initiated with the aim to characterize muscle proteins and how they change with ageing and disease. We conducted an extensive review of the literature and analysed publically available protein databases. A systematic search of peer-reviewed studies was performed using PubMed. Search terms included ‘human’, ‘skeletal muscle’, ‘proteome’, ‘proteomic(s)’, and ‘mass spectrometry’, ‘liquid chromatography-mass spectrometry (LC-MS/MS)’. A catalogue of 5431 non-redundant muscle proteins identified by mass spectrometry-based proteomics from 38 peer-reviewed scientific publications from 2002 to November 2015 was created. We also developed a nosology system for the classification of muscle proteins based on localization and function. Such inventory of proteins should serve as a useful background reference for future research on changes in muscle proteome assessed by quantitative mass spectrometry-based proteomic approaches that occur with ageing and diseases. This classification and compilation of the human skeletal muscle proteome can be used for the identification and quantification of proteins in skeletal muscle to discover new mechanisms for sarcopenia and specific muscle diseases that can be targeted for the prevention and treatment.
Skeletal muscle is one of the most abundant tissues in mammals, accounting for up to 40% of the total mass of the human body. The human skeletal muscle is a complex organ made up of multinucleated long and cylindrical cells called myofibres that are, in turn, composed of myofibrils. Myofibrils consist of proteins called myofilaments, mainly actins and myosins, that form the sarcomeres, which are the smallest contractile units. The contraction–relaxation cycle in muscle requires energy that is mostly generated aerobically by mitochondria particularly abundant in adult muscle fibres. Contraction is stimulated by motor neurons that are resident in the spinal cord and connect with muscle fibres through a specialized synapsis known as the neuromuscular junction. Acetylcholine released into the neuromuscular junction cleft interacts with specific receptors and generates an action potential across the surface of the muscle cell that rapidly diffuses within the muscle fibres through a network of specialized tubules called T tubules. The action potential triggers the opening of the dihydropyridine receptor that allows influx of calcium (Ca2+) in the cytoplasm that stimulates ryanodine receptors to release more Ca2+ from the sarcoplasmic reticulum. Ca2+, in turn, triggers the interaction between contractile proteins and elicits a shortening of the muscle fibre. Such structural and functional complexity is mirrored by the large repertoire of proteins found in skeletal muscle.
Maintenance of muscle mass and effective muscle function throughout life is essential to preserve a healthy and active lifespan and to maintain physical function and autonomy in old age. Conversely, the development of overt sarcopenia has devastating consequences on individuals' life because it is associated with mobility loss, higher risk of disability, nursing home admission, and death.[3, 4] Thus, understanding the causal pathways leading to age-related sarcopenia and develop strategies for prevention and cure of this condition is essential to improve the well-being and quality of life of a growing number of older individuals in the population. There is strong evidence that diabetes type 2, cancer, chronic kidney diseases, chronic pulmonary disease, and bed rest cause accelerated muscle mass and strength, which is superimposed to the effect of ageing. However, the mechanisms by which ageing and diseases cause sarcopenia are still not fully understood and may be quite different. With the exception of exercise, there is currently no effective treatment for sarcopenia.
Most of current research on sarcopenia is based on clinical characterization, analysis of blood biomarkers, and examination of muscle biopsies.[5-8] Histological examination of muscle specimens is generally considered as the gold standard test for most myopathies but, especially for age-related sarcopenia, histology lacks diagnostic specificity and has limited value for monitoring the effect of interventions.
We propose that proteomic analysis of muscle tissue, especially using a discovery approach complemented by targeted, quantitative mass spectrometry (MS), may overcome the many limitations of previous approaches to the study of sarcopenia. Specific proteomic signatures, or the overrepresentation or underrepresentation of certain proteins or even the altered stoichiometric interrelationship between proteins, may provide clues on the processes that lead to muscle impairments, perhaps also allowing the diagnosis of the different conditions, including age-related sarcopenia. For example, changes in mitochondrial proteins may suggest that scarcity of energy is at the root of muscle degeneration. Also, proteomics could be used to monitor at an early stage the potential effectiveness of intervention aimed at normalizing muscle. This approach would add considerable value to the simple monitoring muscle mass and strength. In fact, interventions aimed at expanding muscle mass, such a testosterone or selective androgen receptor modulators, have shown little impact on physical functions such as walking performance; the structural and biochemical reasons for such a discrepancy are not clear.[9, 10] On the other hands, modifications in abundance of specific muscle proteins may occur long before any improvement in strength is clinically detectable. Overall, the complementary use of advanced high-sensitivity discovery-mode and targeted, quantitative MS/MS proteomic analysis of skeletal muscle specimens may provide new insights into the pathogenesis of skeletal muscle in health and disease through the characterization of protein abundance, quality, and diversity, including their isoforms and post-translational modifications (PTMs). Application of proteomics may help to identify biomarkers of adaptive or maladaptive biological mechanisms manifested during ageing or diseases.
A critical step for making progress in this field is a comprehensive characterization and classification of the human skeletal proteome. Because it is well known that muscles are different in men and women and change substantially with ageing, such normative data will have to be developed in men and women of different age, possibly free of major conditions or treatment that may affect muscle proteins. Similarly, it is important to have an available annotation system that characterize proteins in terms of their cell localization and function.
We performed a search of the literature of published studies that had reported on skeletal muscle proteomic analyses in humans. The goal of this review is to summarize the state of the art in muscle proteomic research as the first step of a new initiative of the Biology/Disease-driven Human Proteome Project. Then, using data from available databases and an extensive review of the literature, we created a refined annotation system that improved our ability to classify muscle proteins based on their cellular localization and function. Finally, we outline what research should be performed to create a reference atlas of the normal, human skeletal muscle proteome.
Human skeletal muscle proteome
Caveats and challenges in performing and interpreting skeletal muscle proteomics
Human skeletal muscle has mainly been studied in muscle biopsies obtained under local anaesthesia. The percutaneous needle introduced by Bergstrom in 1962 yields useful biopsy material (around 100–300 mg) that has been used successfully for proteomic analysis conducted by MS.
Previous attempts to characterize the muscle proteomics have been incomplete and mostly focused on muscular dystrophies.[15, 16] Although there are hundreds of skeletal muscles in the body, nearly all proteomic studies have focused on vastus lateralis, quadriceps, and deltoid muscles, and therefore, most of the current knowledge is skewed towards these muscles. This is problematic because different muscles may have quite distinct protein composition affected by different content in adipose and connective tissue, degree of vascularization and innervation, and fibre type composition. Skeletal muscle fibres can be divided mainly into three main subtypes distinguished by their contractile and metabolic properties: (A) slow, oxidative, type 1 fibres (measured by staining myosin-7, encoded by MYH7); (B) fast, oxidative, type 2A fibres (measured by myosin 2, encoded by MYH2); (C) fast, glycolytic, type 2x fibres (measured by myosin-1, encoded by MYH1). The relative percentage of fibre types tends to be muscle specific and is also affected by the overall level of physical activity and metabolic diseases such as diabetes.[18-20] All these factors may affect protein synthesis and degradation. The large number of proteins and the wide dynamic range in abundance of different proteins in skeletal muscle are one of main challenges in performing high-fidelity proteomics. In fact some of the largest (titin) and smallest (phospholamban) proteins in the human proteome are found in this tissue. In addition, sample preparation becomes critically important because the contractile proteins are so abundant that without adequate fractionation, the chance to detect low abundant proteins, such as signalling enzymes, becomes low.
Conditions and factors affecting the composition of the human skeletal muscle proteome
Skeletal muscle not only contains proteins that pertain to muscle cells but also is unavoidably contaminated by proteins that pertain to adipose, connective, vascular, and neural tissue. To better understand the difference between specimens, proteins should be classified based on the subcellular localization (i.e. contractile apparatus, nuclei, mitochondria, sarcoplasmic reticulum, transverse tubules, Golgi apparatus, sarcolemma, extracellular matrix, ribosomes, cytoskeleton, and the cytosolic fraction) function and metabolic and signalling pathways (i.e. glucose metabolism, electron transport chain and OXPHOS system, and calcium homeostasis).[22-25] Early proteomic studies identified that most of the proteins of skeletal muscle are localized in the sarcomere (i.e. myosins, actins, troponins, tropomyosins, and auxiliary proteins of sarcomeric units), representing about 55–60% of the total proteins.[18, 22, 26-28] Mitochondrial proteins constitute around 28% of the transcripts found in skeletal muscle confirming the importance muscle in the energy metabolism. Of note, most of these proteins, especially sarcomeric, glycolytic, and mitochondrial proteins exist in different isoforms (i.e. myosin has more than 20 isoforms), therefore highly affecting the functionality, the integrity, and the protein synthesis in skeletal muscle, especially during ageing and some other pathological conditions.
During physical activity or movement, skeletal muscle is highly metabolically demanding, therefore, a very large number of proteins and enzymes related with all the metabolic pathways as well as proteins involved in the cellular stress response such as heat shock proteins (HSPs) will be prone to change in abundance during these states.[18, 30]
Also limited literature suggests that contractile, structural, and mitochondrial proteins, together with metabolic enzymes, can all be affected by conditions that affect muscle health (i.e. diabetes type 2, obesity, ageing, age-related diseases, myopathies, and dystrophies) and behavioural stress such as extreme exercise. Factors in liquid chromatography-mass spectrometry (LC-MS/MS) proteomic studies that have been found to affect muscle proteome composition are summarized in the next sections and in Table1.
Age-related changes in the contractile apparatus are mainly the result of transition from fast-twitching fibres (mainly IIx) to slow-twitching muscle fibres.[32-34] Proteomic studies have been shown that myofibrillar and cytoskeletal proteins are less abundant in vastus lateralis in older compared with middle-aged women.[27, 35] Ageing also appeared to change the abundance of enzymes involved in oxidative metabolism, anaerobic metabolism, and the glycolysis pathway, such as adenosine triphosphate (ATP) syntase β-chain, Acyl-CoA dehydrogenase, aconitase, 2-oxoglutarate dehydrogenase, malate dehydrogenase, creatine kinase, glycogen phosphorylase, glyceraldehyde-3-phosphate dehydrogenase, among others. Other proteins that are similar to the HSP are involved in detoxification and cytoprotection functions (i.e. carbonic anhydrase 3 or elongation factor 2, and ubiquitin-40S ribosomal protein S27a) decrease during ageing, with reduced scavenging of reactive oxygen species accumulation of aggregated proteins.[30, 35, 36]
Exercise and physical inactivity
Although it is well known that skeletal muscle adaptations to exercise depend on duration, intensity, and frequency, changes in muscle proteins associated with different types of exercise have not been well characterized. Myofibrillar protein isoforms are most affected, mainly in the distribution of myosin heavy chain isoforms, thin filament isoforms, sarcoplasmic reticulum calcium proteins, mitochondrial proteins, and some chaperones involved in cell stress response. Studies also found (i) changes in fibre type composition; (i) change in enzymes implicated in energetic metabolism, such as the E3 subunit of the pyruvate dehydrogenase complex, fumarate hydratase, malate dehydrogenase, and aspartate aminotransferase; and (iii) expression of some of the components of the electron transport chain such as complex I, IV, and ATP synthase subunits, although there is wide heterogeneity between studies on specific results.[38-41]
Long bed rest is known to induce considerable atrophy in lower limb muscles accompanied by a reduction in cross-sectional area and fascicle length. Bed rest proteomic studies with or without exercise have shown muscle fibre atrophy accompanied by a deregulation in myosin heavy chain distribution, downregulation of thin filament proteins, and changes in fibre type composition (increased type I and decreased type IIA fibres) in both, vastus lateralis and soleus muscles. Moreover, these changes are usually accompanied by (i) down-regulation of proteins involved in aerobic metabolism (i.e. dihydrolipoil dehydrogenase, succinyl-CoA ligase, and aconitate hydratase between others); (ii) up-regulation of proteins involved in anaerobic glycolysis such as some isoforms of glycogen phosphorylase and α-enolase; and (iii) down-regulation of the antioxidant defence system proteins.[43, 44]
Hypoxia and extreme environmental stress
The effects of hypoxia in skeletal muscle remain controversial. Early reports suggested that chronic hypoxia might positively affect muscle oxidative capacity and capillarization through up-regulation of hypoxia inducible factor-1 alpha, the primary transcriptional response factor for hypoxic adaptation.[45, 46] However, continued exposure to hypoxia has negative effects on muscle structure, producing muscle atrophy.[47-49] Using a proteomic approach, it was found that under hypoxic conditions, vastus lateralis contractile proteins such as the γ-actin isoform were decreased together with other contractile proteins such as D repeat protein 1, vinculin, tubulin, Kelch repeat, and desmin. Furthermore, proteins involved in tissue injury hypoxia (mainly α-enolase), in oxidative metabolism (i.e. 2-oxoglutarate dehydrogenase, malate dehydrogenase, aconitate hydratase, and electron transport chain mitochondrial proteins), in fatty acid β oxidation, creatine biosynthesis, lipid peroxidation, and proteins involved in oxidative stress protection were also less abundant under hypoxic conditions. A proteomic study that aimed to elucidate the mechanisms by which Tibetans born and living at high altitude could adapt to hypoxia found that the most differentially overexpressed proteins in vastus lateralis compared with subjects living a low altitude were those related with oxidative and glycolytic pathways such as glutathione-S-transferase P1-1, Δ2-enoyl-CoA-hydratase, Nicotinamide adenine dinucleotide (NADH)-ubiquinone oxidoreductase, myoglobin, glyceraldehyde-3-phosphate dehydrogenase, and lactate dehydrogenase, suggesting a metabolic switch to cope with this hypoxic conditions and reinforcing protection against reactive oxygen species tissue damage.
Conversely, changes in diet and environment, such as cold or hot conditions, can negatively impact muscle physiology.[52, 53] The increase of energy expenditure upon cold exposure or dietary changes is called adaptive thermogenesis. The exact mechanisms by which skeletal muscle adapts to thermogenesis still remains unknown and may be related to changes in the mitochondrial uncoupling. A study evaluating the differences of the human skeletal muscle proteome under cold and overfeeding conditions found that extreme conditions produced higher levels of contractile proteins (including fast and slow isoforms) and metabolic enzymes, mainly related with glycolysis and fatty acid metabolism, suggesting that all these modifications were related to alterations in energy expenditure after cold exposure.
Metabolic diseases and other pathological conditions
Accumulation of adipose tissue that occurs with ageing and the metabolic syndrome, especially ectopic lipid deposition such as enlarged visceral fat depots, the liver, skeletal muscle, and other non-adipose tissues, tends to be associated with loss of muscle mass as well as decline in muscle quality. Obesity may negatively impact human skeletal muscle through accumulation of lipids such as long-chain diacyl-glycerols or ceramides, production of inflammatory mediators, interference with maintenance, and repair mechanisms and other still unknown mechanisms. Studies conducted on digitonin extracted cytosolic proteins in rectus abdominal muscle from obese women have shown increased expression of metabolic pathway proteins such as adenylate kinase 1, glyceraldehyde-3-phosphate dehydrogenase, aldolase A, fatty acid-binding protein 3, and pyruvate kinase, and a decrease in carbonic anhydrase 3 compared with lean control subjects, suggesting that these differences represent a compensatory mechanism to counteract the muscle mitochondrial dysfunction associated with obesity already described in other studies[23, 58]
Type 2 diabetes
There is strong evidence from epidemiological studies that people with insulin resistance or type 2 diabetes experience accelerated decline of muscle strength and muscle mass, although the reason for this association is not clear. Defects in the activity of the metabolic enzymes and in the insulin signalling pathways associated with decreased muscle glycogen synthesis have been described in the muscle of these patients.[60, 61] Proteomic studies of human skeletal muscle have shown that proteins involved in the tricarboxylic acid cycle (TCA cycle), mitochondrial respiration, and other mitochondrial functions such as ATP synthase β and α isoform and creatine kinase brain isoform, among others, were less abundant in vastus lateralis from insulin-resistant subjects, suggesting an increased glycolytic and decreased mitochondrial protein abundance together with a shift in muscle properties towards a fast-twitch pattern in the absence of marked changes in fibre-type distribution.[58, 62] Proteins involved in cytoskeletal structure and function like the actin capping and Z-disc component and the four isoforms of α-1 chain of type VI collagen, isoenzymes of myosin regulatory light chain 2A and B, myosin 15, and troponin isoforms were significantly altered together with some chaperone and co-chaperone proteins.[62, 63]
Exercise is beneficial in patients with type 2 diabetes and plays a major role in the prevention and control of insulin resistance. Proteomic studies have shown that a 4 week training programme of endurance exercise in patients with type 2 diabetes increased proteins in vastus lateralis involved in energy metabolism via glycolysis, TCA cycle, electron transport chain, and β-oxidation, mainly lactate dehydrogenase B chain, mitochondrial creatine kinase, aspartate aminotransferase acetyl-coA-acetyltransferase, cytochrome b-c1 complex subunit 2, isoform 1 of succinyl-CoA ligase (ADP-forming) subunit beta, isoform cytoplasmic of fumarate hydratase, and isocitrate dehydrogenase. There was also a reduction in the levels of glycolytic proteins, including isoform 1 of 6-phosphofructokinase, fructose-biophosphate aldolase A, and lactoylglutathione lyase, or glyoxalase 1.
Specific muscle diseases and neuromuscular disorders
Proteomic studies performed in skeletal muscle from patients with skeletal muscle diseases such as hereditary inclusion body myopathy, myofibrillar myopathies, desminopathies, and some muscle disorders caused by collagen VI mutations have shown that the most abundant changes are found in proteins related to cytoskeleton and sarcomere functions, basically caused by mutations in genes encoding sarcomeric and extra-sarcomeric proteins.[66-70] This will affect not only the skeletal muscle but also their connectives tissues producing contractures and muscle weakness.
Similar results have been found in patients with amyotrophic lateral sclerosis a severe and fatal neurodegenerative disease characterized by the selective death of motor neurons in the motor cortex, brainstem, and spinal cord.[71, 72] The pathogenesis of this disease is still unknown, but it has been suggested that skeletal muscle could play an important pathogenic role.[71, 73]
Post-translational modifications in skeletal muscle proteome
Post-translational modifications are chemical alterations to protein structure, typically catalysed by substrate-specific enzymes that may change the properties of a protein by proteolytic cleavage or by addition of a modifying group to one or more amino acids. More than 300 different types of PTMs are known, and many are implicated in diseases such as cancer, ageing, and age-related diseases.[75, 76] Protein phosphorylation and acetylation play key roles in signal transduction pathways regulating energy metabolism, contractile function, and muscle mass in human skeletal muscle, but the role of protein phosphorylation and acetylation in normal skeletal muscle and in skeletal muscle disorders remains still poorly understood because of technical limitations related to both the stability of PTMs in collected biopsies and limits of proteomic technology (see succeeding text).
In the first global analysis of the in vivo phosphoproteome of human skeletal muscle from vastus lateralis in healthy subjects, 367 phosphorylation sites in 144 phosphoproteins/phosphoprotein groups were described. More than one-quarter were sarcomeric proteins from the contractile apparatus, such as thin actin and thick myosin-containing filaments, and M-line and Z-disc-associated proteins. Other phosphorylation sites were identified in some enzymes of glycogen metabolism and in some kinase and phosphates subunits that regulate the phosphorylation of glycogen synthase and phosphorylase.[22, 77] In mitochondria isolated from vastus lateralis, other authors identified 155 distinct phosphorylation sites in 77 mitochondrial phosphoproteins mainly involved in apoptosis, oxidative phosphorylation (the most abundant), TCA cycle, fatty transporters and β-oxidation, amino acid degradation, import machinery and transporters, and calcium homeostasis. The authors suggested that reversible phosphorylation of mitochondrial proteins may play a role in the mitochondrial dysfunction that is found in different pathologies such as muscle disorders, type 2 diabetes, and age-related diseases.
In another study, the same investigators examined the effects of 4 h stimulation with physiological levels of insulin on the phosphorylation of mitochondrial proteins in human skeletal muscle in vivo in healthy individuals. They identified 207 phosphorylation sites, of which 45% were identified both in basal and insulin-stimulated samples, whereas almost half of them were identified exclusively in the insulin-stimulated samples. The most abundant phosphoproteins were those involved in oxidative phosphorylation and the TCA cycle. Furthermore, they found multiple phosphorylation sites in components of the mitochondrial inner membrane organizing system. The authors concluded that these results suggest that insulin might regulate the phosphorylation of mitochondrial proteins in skeletal muscle. Also, a recent phosphoproteomic study in human skeletal muscle showed that some 5' adenosine monophosphate activated protein kinase (AMPK) substrates regulate mitochondrial respiration in response to exercise via phosphorylation.
Finally, another proteomic study comparing skeletal muscle from obese with or without type 2 diabetes found significant differences in phosphorylation and ubiquitinylations in some of the most important glycolytic enzymes such as glycogen phosphorylase muscle isoform, β-enolase, and pyruvate kinase 2 together with other contractile and cytoskeleton proteins such as glyceraldehyde 3-phosphate dehydrogenase, myosin regulatory light chain 2, ventricular/cardiac muscle, and myosin regulatory light chain 2 skeletal muscle.
Compilation of the current human skeletal muscle proteome
In order to compile a comprehensive atlas of the skeletal muscle proteome, we searched PubMed using the terms ‘skeletal muscle’ combined with either ‘proteome’, ‘proteomics’, ‘MS’, and ‘LC-MS/MS’, using a filter for ‘human’ and published before December 2015. Papers were excluded if they involved proteomic investigations of human cell lines or studies in which low abundance of proteins were detected using immunological methods rather than proteomic and MS approaches. Thirty-eight publications fit the aforementioned criteria and were used for compiling the human skeletal muscle proteome (Figure1). The Uniprot database was used as the authoritative reference for human proteins. Lists of proteins were compiled for each paper based upon protein data in the published tables and/or supplementary tables. In order to combine all protein identifications into one list, International protein index accessions and National Center for Biological Information protein accessions were converted to Uniprot accessions by Uniprot ID conversion. Several classes of proteins were removed from the resulting set. Proteins that have been merged into other accession numbers or removed from their respective databases were considered obsolete and removed. Proteins that were duplicated between identifications were also reduced to the single identification. Finally, obvious contaminants, which were decoy proteins, trypsin, and non-human proteins, were removed. The final filtered list that contained 5431 proteins were identified for the human skeletal muscle proteome. We also identified the top 20 most reported skeletal muscle proteins from the 38 publications (Figure2).
The neXtProt database is a recently developed database, which incorporates only high-quality, well-studied proteins as the human proteome. Therefore, we used this neXtProt to validate our final human skeletal muscle proteome. In our list of 5431, 4947 had neXtProt accessions (Table S1). There were 484 proteins without neXtProt accessions (Table S1, sheet 2) as of December 2015. The high rate of concordance between neXtProt and our dataset shows that we have a proper filtering method that provided a quite comprehensive collection of the proteome.
Many of the earlier studies of the human skeletal muscle proteome involved two-dimensional gel electrophoresis and Matrix-assisted laser desorption/ionization- time of flight (MALDI-TOF), a workflow that is mainly useful to identify proteins that are differentially expressed between groups.[25, 33, 36, 51, 58, 62] In compiling this list of human skeletal muscle proteins from that manuscripts, we encountered several problems: (i) papers based on earlier technology may have been limited by ambiguous identification of proteins; (ii) most studies did not state the precise criteria used to identify proteins, including false discovery rates (FDR) for peptides and proteins; (iii) some reported isoforms have no accompanying peptide data, thus, it is not possible to determine whether a peptide amino acid sequence that is unique to the specific isoform was actually observed; (iv) most studies did not provide spectra and did not have raw MS data available in a public repository for further inspection and corroboration; and (v) a few studies did not even specify which skeletal muscle was biopsied for study. Some studies used an FDR for proteins >1%, which by current standards is not considered to be sufficiently rigorous. For example, the Plasma Peptide Atlas uses a stringent 1% FDR for proteins (which corresponds to a 0.2% FDR at the peptide level). If such stringent criteria were applied to the human skeletal muscle proteome, as compiled in the present paper, the number of identifications would be anticipated to be less than 5431 non-redundant proteins in Table S1. However, in certain cases, an investigator may be justified in allowing an FDR up to 5%, in particular for older mass spectrometers. Furthermore, we also note that the question of the correct method of calculating FDR is not completely settled.[84, 85] Therefore, for our list, we accept identifications reported at the thresholds from published datasets. In the future, the redundancy of protein identifications and quality of reporting of peptide data from the human skeletal muscle could be improved with standards established by the human skeletal muscle proteome initiative.
Finally, we compared our proteome to three other datasets relevant to the skeletal muscle proteome: a non-MS, antibody-based, global human proteome, called Human Protein Atlas[28, 86, 87]; a mouse proteomics dataset from skeletal muscle[88, 89]; and an RNA-seq dataset produced from human skeletal muscle from 361 individuals, from the Genotype-tissue Expression (GTEx) project. While we are interested in the MS-based human proteome, transcript data and antibody-based protein data can still inform the analysis of proteomics data, and mouse skeletal muscle is also a useful comparison. In order to translate between transcript and protein, and between mouse and human, we compared all three skeletal muscle studies at the genome level. In mouse skeletal muscle, there were 9887 genes reported from 10 218 proteins, while Human Protein Atlas retrieved 5389 skeletal muscle coding genes, and GTEx contained 7077 skeletal muscle-specific genes. To reduce the complexity of analysis, we combined the transcriptome data from Protein Atlas and GTEx to 10 962 genes. Our compiled skeletal muscle shows 5024 genes for 5431 proteins. The comparison of these datasets shows that there is likely more of the proteome to be mapped in human skeletal muscle (Figure3).
Functional annotation of skeletal muscle proteome
To describe the skeletal muscle proteome, we characterized the 5431 compiled proteins based on gene ontology (GO), Uniprot keywords, and manual annotation. In particular, we annotated proteins by Uniprot keywords when they were available. When the Uniprot keyword was not available, we used the GO annotations and then manually searched the literature for an exact annotation. In all proteins with multiple annotations, the literature was consulted, and the protein was manually annotated towards its localization and function in skeletal muscle. Protein numbers were normalized to the total number of ontology annotation terms identified for all the proteins. The justification for this annotation process is as follows: GO provides a set of hierarchical controlled vocabularies split into three categories: cellular component, molecular function, and biological process. However, GO is designed to be species neutral, an approach that tends to describe very broad functional domains.[91, 92] UniProt contains a controlled vocabulary that is specific to species. Therefore, to make the annotation more relevant to human biology, we combined the GO annotations with UniProt keyword annotation and direct literature search.
By this process, we were able to annotate 5031 proteins over 7934 terms for the cellular component, 4316 over 5795 for molecular function and, finally, 4395 proteins over 6501 terms for biological function. One hundred and three serum proteins were excluded from both molecular function and biological process. Some proteins are multiple counted because they are in several compartments, or have different molecular functions, or are involved in multiple different biological processes.
Classification of the human skeletal muscle proteome based on the cellular components
Gene Ontology Consortium (GOC) defines the cellular component as ‘locations, at the levels of subcellular structures and macromolecular complexes’. Based on this, we classified 5031 proteins. We found that the majority of the skeletal muscle proteins are compartmentalized in these major compartments: cytoplasm (~30%), cell membrane (~24%), nucleus (~19%), and mitochondria (~11%) (Figure4A). Membrane proteins include nuclear membrane, cell membrane, inner and outer mitochondrial membrane, matrix membrane, and Golgi membrane. Only a small percentage (7%) of the proteins were not assigned to any compartment (no information available), and we called them as ‘undetermined’. Of note, a high percentage of proteins can be located in multiple compartments (95%) (data not shown).
Classification of the human skeletal muscle proteome based on the molecular function
GOC defines molecular function as ‘Elemental activities, such as catalysis or binding, describing the actions of a gene product at the molecular level’. Based on this definition, we were able to define the molecular function of 4316 proteins over 5795 annotated terms. Specifically, we identified 18 different major molecular functions (Figure4B). In brief, 40% of the proteins have an enzymatic function and almost 56% (~1300 proteins) do not have any molecular functional details. Interestingly, about one-half of the proteins identified in the muscle have regulatory, structural and signalling function underlying the complexity of the muscle as a metabolic and secretory organ, which is not limited to energetic metabolism. In fact, the TCA cycle, electron transport respiratory chain, and ATP synthesis only constitute the 3.1% of the skeletal muscle proteome (Figure5A).
The highest percentage of molecular function annotations is for enzymes (Figure5B). Changes in the muscle enzyme levels might be a signal of muscle damage or may underlie some myopathies or another defective process during ageing process, but there is no enough information in the literature on this topic.
Classification of the human skeletal muscle proteome based on biological processes
Finally, we classified the skeletal muscle proteome based on biological processes. GOC defines the biological process as ‘a recognized series of events or molecular functions with a defined beginning and end’. Because we aimed for a parsimonious approach, and because we found around 80 different biological processes described, we summarized them into 15 major biological processes, namely, (i) ‘transport process’ that includes protein transport, lipid transport, ion transport, electron transport, endocytosis, and exocytosis; (ii) ‘cell cycle/process’ that includes cell shape, cell adhesion, cell growth, chemotaxis, chemoattraction, cell proliferation, apoptosis, angiogenesis, cell killing, autophagy, cell division, mitosis, and meiosis; (iii) ‘metabolism’ that combines amino acid, carbohydrate, ketone body, fatty acid, glycogen, glycan, lipid, protein, nitrogen compound, and aldehyde metabolism; (iv) ‘bioenergetics process’ that includes glycolysis, Krebs cycle, pentose shunt, urea cycle, and ATP synthesis (Figures4C and 5A). Based on this, we were able to annotate 4395 proteins over 6501 annotated terms.
Future directions and gaps in knowledge
By performing a comprehensive review of the literature, we have shown that there are at least 5431 different proteins in human skeletal muscle. In spite of the extensive work done so far, our knowledge is still too limited to translate skeletal muscle proteomics into a powerful clinical tool. First of all, there are >200 muscles in the human body, and most of the data available are based on biopsies collected almost exclusively in the quadriceps and deltoid muscle. It is known that there are considerable differences between muscles in protein and enzymatic composition based on the muscle fibre type composition; therefore, some of them will be mainly glycolytic and involved in fast contractions, and others will be primarily aerobic, and with a higher oxidative capacity. Thus, the list of proteins reported here does not apply to other specialized muscle types, such as laryngeal muscles and extra-ocular muscles in humans. Also, skeletal muscle is highly abundant in different proteoforms that may play an important role in muscle function and could affect the composition of the skeletal muscle proteome. In addition, normative proteomic data on muscle proteomics in disease-free men and women dispersed over a wide age range and assessed by rigorous proteomic, and MS approaches and state of the art technology are not available. The absence of a solid reference atlas of skeletal muscle proteins makes the identification of changes in the abundances of proteins specifically associated with different disease or conditions a formidable task.
Most of these obstacles are not insurmountable. For example, vastus lateralis biopsies are being performed in the Baltimore Longitudinal Study of Aging, the America's longest-running scientific study of human ageing, in participants dispersed over a wide age range who are selected to be healthy according to very strict inclusion and exclusion criteria that will increase the likelihood of producing reliable and reproducible proteomic results among others.
Significant technological progress that was made in MS over the last decade substantially improved quality of proteomics data in fundamental biological and pathophysiological research. Proteomic studies became a valuable asset for the global tissue or cell-specific protein profiling, biomarker discovery, and quantitative proteome studies. Enhanced mass accuracy, acquisition rate, and sensitivity on some of the newly released instruments in combination with advanced liquid chromatography methods allow to record large amounts of information and accurately identify peptides at a very high confidence level. Thousands of proteins can be identified just in a single shotgun experiment in a robust and highly reproducible manner using simplified, accurately controlled sample preparation protocols. Extremely complex biological samples with challenging dynamic ranges, such as skeletal muscle tissues or human plasma, can be analysed in depth by employing advanced offline high performance liquid chromatography/ultra light pressure liquid chromatography (HPLC/UPLC) protein or peptide fractionation methods including ion exchange, basic reverse phase, hydrophilic interaction (HILIC), in solution isoelectric focusing (IEF), or traditional 1D, 2D gel electrophoresis techniques with dramatically increased yield in protein discovery allowing detection of functionally important low-abundant proteins (the ‘deep proteome’ analysis).[93-96] Furthermore, improved sample preparation and fractionation techniques including isolated organelles such as mitochondrial, nuclear, or membrane fractions as well as IP of Tyr-phos proteins, pull down of, for example, 14-3-3 proteins, pull down of interaction partners of specific signalling enzymes, for example, IRS-1, and so on can be also used to target highly specialized organelle proteomes and unique protein complexes in various cell models at different biological states.
Quantitative proteomic methods are also evolving rapidly alongside with the improving instrument quality merging both discovery and quantitative proteomic studies in one unified analysis. Recently proposed SWATHTM technology, a unique open window acquisition technique available on some of triple TOFs or DIA protocols specific to some of the Orbitraps are used today as a powerful alternative to some of the traditional peptide-labelling techniques such as isobaric tag labelling (iTRAQ or TMT).[98, 99] Peptide or protein-labelling techniques are used broadly in proteomic studies for the relative and absolute protein quantification in the complex cell or tissue lysate analysis. Targeted methods in proteomics such as selected/multiple reaction monitoring are widely used as accurate quantitative assays but also as additional validation techniques in the discovery studies,[100, 101]
Methods and instruments are also improving for the detection of PTMs such as phosphorylation, glycosylation, acetylation, ubiquitination, sumoylation, and so on. The observation that acetylation and phosphorylation are often identified on the same peptides and proteins, and probably interact, is a new interesting area of research. PTMs are important to study because they reflect the diversity of protein function.[73, 102] Many of the PTMs are difficult to study because they are labile to sample processing and LC-MS artefacts. For example, O-GlcNAcylation, an important PTM that rivals phosphorylation in abundance and distribution and has been implicated in chronic diseases, has been especially challenging to detect and measure. At least in part, advancements in this field are due to steep improvement of bioinformatics tools and better search algorithms and proteins reference databases. Many proteins have functions that are unknown or not well understood. By studying the proteins with which a particular protein interacts, it is possible to deduce biological functions and pathways using in silico structure-based approach. Protocols have recently been developed for proteomic analysis of formalin-fixed and paraffin-embedded tissues and equivalent to proteome inventories obtained from frozen tissue specimens. Thus, using also these tissues could be a valuable resource for retrospective biomarker discovery studies.
A single technique or method alone is not enough to define or characterize a proteome. Because one of the most challenging issues for the scientific community nowadays is the identification of biological markers, all these advances in the proteomic field together with high-throughput technologies such as genomics, metabolomics, and other ‘omics’ will allow us to identify molecular pathways and novel protein–protein interactions that are involved not only in pathological conditions but also in normal human biology.
Cluster analysis is a statistical approach used in microarray research that identifies genes within a cluster that are more similar to each other than genes contained in different clusters. By grouping genes that exhibit similarities in their expression patterns, the function of those genes which were previously unknown may be revealed. There are two groups of clustering methods, hierarchical and non-hierarchical. Non-hierarchical algorithms require the number of clusters (k) be pre-specified. Non-hierarchical algorithms can run multiple times with different values of k. The user can then choose the clustering solution that is logical to address the problem of interest.
If we consider each gene as a point in high dimensional space, then "clusters may be described as continuous regions of this space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points. Clusters described in this way are sometime referred to as natural clusters" .
Despite the use of cluster analysis in microarray research, the evaluation of the "validity" of a cluster solution has been challenging. This is due, in part, to the properties of cluster analysis. Cluster analysis has no null hypothesis to test and hence no right answer, which makes the testing of the validity of specific solutions, algorithms, and procedures difficult . A second challenge encountered is that genes may not "naturally" fall into clusters separated by empty areas of the attribute space in genome expression studies. Hence, genome-wide collections of expression trajectories may lack a "natural clustering" structure in many cases . Third, the result of gene clustering may be "method sensitive". That is, gene clustering depends on several methodological choices, including the distance metric used, the clustering algorithm, and the stopping rule in the case of iterative partitioning methods. Hence, it is important to evaluate the stability of any specific derived cluster solution and the general performance of clustering approaches.
According to McShane et al., "Clustering algorithms always detect clusters, even in random data and it is imperative to conduct some statistical assessments of the strength of evidence for any clustering and to examine the reproducibility of individual clusters" . Roth et al. defined stability as "the variability of solutions which are computed from different data sets sampled on the same source" . It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable . The concept of a replicable cluster is defined as reproducible across multiple samplings from the same population. Thus, some methodologists have suggested that the validity of clustering methods could be defined as the extent by which they yield classifications that are reproducible beyond chance levels. Most recently, Tseng et al.  identified stability of clusters in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. Famili et al.  summarized the related work as follows:
Zhang et al. proposed a parametric bootstrap re-sampling method (PBR) to incorporate information on variations in gene expression levels to assess the reliability of gene clusters identified from large-scale gene expression data...Smolkin et al. assessed the stability of a cluster using their Cluster Stability Score, by which a cluster's stability is calculated through clustering on random subspace of the attribute space...Ben-Hur et al. proposed a stability-based re-sampling method for estimating the number of clusters, where stability is characterized by the distribution of pair-wise similarities between clusters obtained from sub-samples of the data...Datta et al. formulated 3 other validation measures using the left-out-one condition strategy to evaluate the performances of 6 clustering algorithms...Giurcaneanu et al. introduced a stability index to estimate the quality of clusters for randomly selected subsets of the data.
Clusters that produce classifications with greater replicability would be considered more valid . The objective of this paper is to determine the performance of commonly used non-hierarchical clustering algorithms and the degree of stability achieved using several microarray datasets.