Scholarly article on topic 'Detection of missing proteins using the PRIDE database as a source of mass-spectrometry evidence'

Detection of missing proteins using the PRIDE database as a source of mass-spectrometry evidence Academic research paper on "Chemical sciences"

0
0
Share paper
Academic journal
J. Proteome Res.
OECD Field of science
Keywords
{""}

Academic research paper on topic "Detection of missing proteins using the PRIDE database as a source of mass-spectrometry evidence"

Journal of

JUUIJIÚI Ul ■

proteome

•research

Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Detection of missing proteins using the PRIDE database as a source of mass-spectrometry evidence

Alba Garin, Leticia Odriozola, Ana Martinez-Val, Noemí del Toro, Rocío Martínez, Manuela Molina, Laura Cantero, Rocío Rivera, Nicolás Garrido, Francisco Domínguez, Manuel M Sánchez del Pino, Juan Antonio Vizcaíno, Fernando Jose Corrales, and Victor Segura

J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00437 • Publication Date (Web): 01 Sep 2016

Downloaded from http://pubs.acs.org on September 5, 2016

Just Accepted

"Just Accepted" manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides "Just Accepted" as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. "Just Accepted" manuscripts appear in full in PDF format accompanied by an HTML abstract. "Just Accepted" manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). "Just Accepted" is an optional service offered to authors. Therefore, the "Just Accepted" Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the "Just Accepted" Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these "Just Accepted" manuscripts.

ACS Publications

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036

Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

7 Detection of missing proteins using the PRIDE

11 database as a source of mass-spectrometry

15 evidence

19 Alba Garin-Muga," Leticia Odriozola," * Ana Martínez-Val,1 Noemí del Toro,§

21 Rocío Martínez," Manuela Molina," Laura Cantero," Rocío Rivera,^ Nicolas

24 Garrido,^ Francisco Dominguez,# Manuel M. Sanchez del Pino,@ Juan Antonio

27 Vizcaíno,§ Fernando J. Corrales,and Victor Segura*""

36 1Proteomics Unit, Spanish National Cancer Research Centre, Madrid, Spain

38 § European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome

40 Trust GenomeCampus, Hinxton, Cambridge, UK

42 " Proteomics Unit (SCSIE), University of Valencia, Valencia, Spain

44 ±Andrology Laboratory and Sperm Bank, Instituto Universitario IVI, Valencia, Spain

46 #Fundacion IVI/INCLIVA, Valencia, Spain

48 @Biochemistry department, University of Valencia, Valencia, Spain

50 ADivision of Hepatology and Gene Therapy, Center for Applied Medical Research,

52 University of Navarra, Pamplona, Spain

55 E-mail: vsegura@unav.es

59 Abstract

ACS Paragon Plus Environment

|Proteomics and Bioinformatics Unit, Center for Applied Medical Research, University of

Navarra, Pamplona, Spain \IdiSNA, Navarra Institute for Health Research, Pamplona, Spain

The current catalogue of the human proteome is not yet complete as experimental proteomics evidence is still elusive for a group of proteins known as the missing proteins. The Human Proteome Project (HPP) has been successfully using technology and bioinformatic resources to improve the characterization of such challenging proteins. In this manuscript, we propose a pipeline starting with the mining of the PRIDE database to select a group of datasets potentially enriched in missing proteins that are

10 11 12

15 subsequently analysed for protein identification with a method based on the statistical

17 analysis of proteotypic peptides. Spermatozoa and the HEK293 cell line were found

19 to be a promising source of missing proteins and clearly merit further attention in

21 future studies. After the analysis of the selected samples we found 342 PSMs sug-

23 gesting the presence of 97 missing proteins in human spermatozoa or the HEK293 cell

25 line, while only 36 missing proteins were potentially detected in retina, frontal cortex,

40 for missing proteins detection in specific biological matrices as revealed for HEK293

42 cells.

ACS Paragon Plus Environment

aorta thoracica or placenta. The functional analysis of the missing proteins detected confirmed their tissue specificity and the validation of a selected set of peptides using targeted proteomics (SRM/MRM assays) further support the utility of the proposed pipeline. As illustrative examples, DNAH3 and TEPP in spermatozoa, and UNCX and ATAD3C in HEK293 cells were some of the more robust and remarkable identifications in this study. We provide evidence indicating the relevance to carefully analyse the ever-increasing MS/MS data available from PRIDE and other repositories as sources

Keywords

C-HPP, missing proteins, MS/MS proteomics, PRIDE database

10 11 12

20 21 22

Introduction

The Human Proteome Project (HPP)1 is an international project to characterize the human proteome through two programs: a chromosome-based strategy (C-HPP) designed in 2010 2'3 and the biology/disease-driven strategy (B/D-HPP).4'5 Researchers from the chromosome-based strategy have used high-throughput proteomics state-of-the-art technology, but major difficulties have arisen in the detection of a set of proteins, the so-called "missing proteins".6-8 These proteins lack experimental evidence obtained by mass-spectrometry or antibody-based techniques and their existence is based on bioinformatic predictions or transcriptomic analyses. In the C-HPP initiative, the reference database for the annotation of human proteins is neXtProt.9 This database assigns experimental evidence to each human protein using a scale with five levels, from PE1 (experimental evidence at protein level) to PE5 (uncertain protein). The missing proteins are annotated as PE2 (experimental evidence at the transcript level), PE3 (protein inferred from homology) or PE4 (protein predicted). As a reference, the database version used in this study (release 01.09.2015) contained 20061 proteins, 16791 of them annotated as PE1 (83.70 % of protein entries). The number of missing proteins was 2680, corresponding to 13.36 % of the total entries in the database.

Several possibilities have been proposed to explain the difficulties in the detection of these proteins, including their low abundance, their tissue expression specificity and their stimulation dependent or development associated expression. In fact, the different methodological approaches applied to characterize missing proteins has confirmed that the selection of the tissue or cell type is critical to the success of these experiments.10-13 One of the most widely used methods for the identification of the samples in which the probability of detection of missing proteins is higher, takes into account the expression level of the corresponding transcripts. Therefore, the integration of genomics, transcriptomics and proteomics is widely used among HPP groups in order to design the experiments needed to improve the annotation of the human proteome.8 In particular, the Spanish Consortium of the HPP (spHPP), responsible for the study of chromosome 16, made a considerable effort to incorporate tran-

4 scriptomic experiments as a tool for the analysis of the proteome. Public datasets from

6 different resources such as the Gene Expression Omnibus (GEO) database14 and the EN-

8 CODE project15 were analyzed in depth to define the set of expressed genes in thousands

10 of samples, including different biological sources (cell lines, normal tissues and cancer sam-

12 ples) and technologies (microarrays and RNA-Seq).16 In addition, a bayesian classifier was

14 developed to score the probability of expression of the missing proteins in more than 3400

16 microarray experiments.17 According to this study, testis, brain and skeletal muscle were

18 the best tissue candidates to detect the higher number of missing proteins using shotgun

20 proteomics.

22 However, even when the analyzed sample is enriched in missing proteins, their identi-

24 fication is still challenging especially when the bioinformatics methods and the statistical

26 thresholds required impose stringent criteria to ensure the reliability of the observations re-

28 sulting from the automatic MS data analysis and sequence assignments. Basically, the MS

53 experiments and the need to develop new bioinformatic workflows and new methods of ex-

55 perimental validation able to circumvent the constraints inherent in the identification of the

57 missing proteins.

evidence for a protein is considered valid when the following conditions are fulfilled: 1 % FDR at PSM, peptide and protein level, more than 1 peptide detected (9 or more amino acids in length) and at least two of which are not shared among the other proteins of the reference database (proteotypic peptides). The recent analysis of the human spermatozoa proteome13 is a good example. In this study those proteins with only one peptide identification were filtered using the set of unique peptides of the missing proteins obtained from the in silica digestion of neXtProt database. The remaining PSMs were manually evaluated by three independent experts allowing the assignment of 94 new missing proteins. Finally, the expression of C2orf57 and TEX37 was validated by immunohistochemistry. This excellent result allowed us to reach two important conclusions: the high accuracy of the available methods to predict the sample of interest based on public transcriptomics and proteomics

10 11 12

20 21 22

In the field of proteomics, a huge amount of shotgun experiments are publicly available in different data repositories.18 The most commonly used resources are the Global Proteome Machine Database (GPMDB, gpmdb.thegpm.org),19 PeptideAtlas (www.peptideatlas.org),20 the ProteomeXchange consortium (http://www.proteomexchange.org/)21 and the PRIDE database.22 More specifically, the members of the ProteomeXchange Consortium are working to standardise data submission and dissemination practises in the field. All proteomic experimental datasets in the HPP must be submitted to any of the ProteomeXchange resources. The stored data types include raw mass spectra data, peak lists, sample metadata and the results of the original analyses (identification and quantification of peptides and proteins). Only the PRIDE Archive database contains at present more than 5000 datasets, including more than 60000 assays.

In this manuscript, we used public MS experiments to obtain guidance in the search for missing proteins. Initially, we assessed the possibility of obtaining information about the samples in which the number of missing proteins is enriched using the PRIDE database. This approach confirmed the results obtained using transcriptome profiles and provided new biological sources to be explored. The experiments selected were downloaded from the database and studied using two data analysis workflows. The number of missing proteins identified by our bioinformatics workflow, based on the analysis of the intersection of the PSM FDR filtering of the experimental results with the proteotypic peptides obtained from the in silico analysis of the reference database (without FDR filtering at protein level) was higher than the number of missing proteins detected applying the HPP guidelines. Upon manual inspection and curation the best spectra assignments corresponding to chromosome 16 or detected in the HEK293 cell line were validated using SRM. Data are provided supporting the detection of DNAH3 in the spermatozoa sample. Moreover, ATAD3C and UNCX proteins, previously related to embryonic development, were also detected in the shotgun experiments and more interestingly, ATAD3C was confirmed by the LC-SRM experiments.

4 Material and methods

7 Analysis workflow

10 We applied an analysis approach based on the detection of proteotypic peptides in shot-

12 gun experiments using FDR filtering at the PSM level13 (Fig. 1) and the results obtained

14 in terms of the number of missing proteins were compared with those resulting from the

16 analysis recommended in the HPP Data Interpretation Guidelines version 2.0.1 (approved

18 2015-12-01). However, a major issue to be previously addressed was the selection of the

20 samples to be analyzed in order to increase the chance of successful missing protein identi-

22 fications. Different approaches had been previously described to select the biological source

24 in which this probability is higher based on gene transcription profiles.8'17 We propose a new

26 prediction which is based on publicly available MS/MS experiments. The PRIDE database 28 was examined22 to obtain the set of experiments in which the number of peptide candidates

30 from the missing proteins is higher (Fig. 1).

34 Data processing of PRIDE and neXtProt databases

37 This study was based on the data mining of public human datasets in the PRIDE Archive

39 database (April 2015), which contained at the time 47409216 PSMs, distributed in 242

41 projects and 7295 assays. The database included 6001962 unique human peptides and 559405

43 different protein accession codes obtained using several search engines, including Mascot,

Sequest, X!Tandem, OMSSA and Phenyx. Although we performed a complete proteome analysis of the samples selected for the study of the missing proteins, the selection of the proper experiments were carried out using only the human PSMs from the missing proteins of chromosome 16. We expected there to be a certain proportionality between the number of peptides from the missing proteins detected in a shotgun experiment and the number of missing proteins present in the sample, although the information about the search engine and the statistical reliability of the identifications were not considered.

10 11 12

20 21 22

neXtProt database (release 20150901)

In silico protein digestion (trypsin)

(Shotgun data analysis

Proteotypic peptide selection

Missing protein detection (proteotypic peptides)

PSM FDR < 1% Protein FDR < 1

Protein inference (conclusive)

Missing protein detection (HPP guidelines)

Proteome

PRIDE database

Missing proteins

Human PSMs

Selection of experiments

Chromosome 16

In silico protein digestion (trypsin) (9-30 amino acids)

# proteins = 20,028 # tryptic peptides = 1,787,406

# proteotypic peptides = 826,137

# proteins (>= 1 peptide) = 19,410

neXtProt database (release 20150901)

Shotgun data analysis

PSM FDR < 1%

# missing proteins = 2,68

# proteins = 10,828

# peptides = 98,319

# proteins = 6,333 # peptides = 35,922

Missing protein detection

# proteins (>=1 peptide) = 122

# proteins (>=2 peptides) = 39

Protein FDR < 1%

# proteins = 8,712 # peptides = 93,015

Protein inference (conclusive)

# proteins = 5,626 # peptides = 64,477

Missing protein detection

# proteins (>=1 peptide) = 58

# proteins (>=2 peptides) = 32

Figure 1: (A) Overall scheme of the analysis pipeline developed to identify missing proteins using the PRIDE database. (B) Summary of the numbers of proteins and peptides in each step of the analysis pipeline developed.

Proteogest software23 was used to perform the in silica digestion of all the proteins contained in the reference database (neXtProt release 20150901). We applied the standard rules of trypsin digestion and allowed oxidation of methionine and two missed cleavages. The processing of the set of tryptic peptides obtained allowed us to find all the proteotypic peptides. In this manuscript, we use the theoretical definition of proteotypic peptide: a peptide generated after the digestion of a protein using a certain enzyme (commonly trypsin) that can only be detected in one protein, without taking into account experimental data or a bioinformatics prediction of MS detectability of the peptide.

Shotgun data analysis using HPP guidelines

The selected datasets were analyzed for protein identification following the HPP guidelines. We searched all the mgf files downloaded from PRIDE against the neXtProt database (release 20150901) using the target-decoy strategy with an in-house Mascot Server v. 2.3

(Matrix Science, London, U.K.) search engine. A decoy database was created using the peptide pseudo-reversed method and separate searches were performed for target and decoy databases.

For each sample, searching parameters were fixed on the basis of the information provided

12 in the metadata associated with the project in PRIDE or by the methods described in the

14 referenced article. False Discovery Rates at PSM level and protein level using Mayu24 were

16 calculated and protein identifications were obtained applying the criteria of PSM FDR < 1 %

18 and protein FDR < 1 %. Protein inference was performed using the PAnalyzer algorithm.25

20 Only those missing proteins labeled as conclusive by this algorithm and with at least 2

22 proteotypic peptides were considered as observed missing proteins in the sample.

26 Detection of proteotypic peptides in shotgun experiments

33 This pipeline used the PSMs with PSM FDR < 1 %, and the peptides identified using this

35 criteria were intersected with the set of proteotypic peptides obtained after the in silico

37 digestion of all the amino acid sequences of the neXtProt database. This approach ensured

39 that the proteins obtained had at least one peptide capable of discriminating them from

41 the rest of the proteins in the reference database. Finally, the spectra assignments of the

43 peptides potentially corresponding to missing proteins were manually curated to select the

45 best candidates. Further verification by SRM was conducted in the indicated matrices.

47 Nevertheless, an estimation of the protein FDR value was obtained by processing the results

We propose an alternative analysis of the proteomics experiments to increase the number of missing proteins detected without a significant loss of the quality of the results (Fig. 1).

against the decoy database in a similar way. We performed the in silico digestion of the decoy database and extracted the proteotypic peptides. We used the minimum Mascot ion score of the target proteotypic peptides with PSM FDR < 1 % to estimate the number of false proteins identifications using the decoy proteotypic peptides with a higher score. The FDR at protein level was calculated as the ratio between the number of decoy proteins and

10 11 12

20 21 22

the number of target proteins detected. Sample collection and preparation

Sperm samples (more than 30 million cells) and HEK293 cells were centrifuged at 800 g for 10 minutes. The supernatant of sperm samples (seminal plasma) was removed and saved in a criotube. The cellular pellet was washed twice with 1.5 ml of PBS, frozen in liquid nitrogen and stored at -20 °C until used. The pelleted cells were thawed and disrupted by addition of lysis buffer (8 M Urea, 2 M Tiourea and 4 % CHAPS) and vigorous agitation in a vortex for 30 min at room temperature. Cell debris was removed by centrifugation at 24100 g for 10 minutes. The supernatants were stored at -20 °C until used. The protein concentration of the supernatant was determined using the Bio-Rad RC DC Protein Assay Kit (#500-0122).

Targeted proteomic analyses (SRM/MRM)

Total cell extracts were loaded into 1D SDS-PAGE gel and run until the sample just entered the resolving gel. Gels were fixed (50 % methanol / 10 % acetic acid), stained with Coomassie (Simply Blue Safe Stain, Invitrogen), washed to reveal the unique band containing the whole proteome and subjected to in gel trypsin digestion. Briefly, the gel section was destained twice with AcN for 5 minutes at 40 °C removing the liquid to complete dryness of the gel. Proteins were reduced and alkylated with 10 mM DTT / 100 mM ammonium bicarbonate and 28 mM iodoacetamide /100 mM ammonium bicarbonate respectively for 10 minutes at 40 °C. Subsequently, gel pieces were dried with AcN for 5 minutes at 40 °C removing the supernatant to complete dryness. Proteins were digested with trypsin (Promega) using a 1:20 trypsin/protein ratio over night at 37 °C. Peptide extraction was performed with consecutive incubations (30 minutes, room temperature) with: 1 % formic acid / 2 % AcN; 05% formic acid / 50 % AcN; 100 % AcN. All supernatants were combined and evaporated to dryness in a speed-vac. Peptides were solubilized in 1 % trifluoroacetic acid and further extracted using a C18 reverse phase sorvent (Pierce C18 Spint Tips) following the manufacturer's protocol.

4 Extracted peptides were dried in a speed-vac before nLC ESI-MS/MS analysis.

6 A total of 23 proteotypic peptides were selected and isotopically labeled standards were

8 synthetized. Peptide standards were prepared at 500, 125, 25, 5 fmol/^l in 2 % acetonitrile,

10 0.1 % FA. Two microliters of the solutions were analyzed in a Qtrap5500 (ABSciex) coupled

12 to a nanoflow high performance HPLC (Eksigent) equipped with a nanoelectrospray ion

14 source. Mobile phases were A (100 % H2O and 0.1 % formic acid) and B (100 % AcN

16 and 0.1 % formic acid). Peptides were separated by C18 reverse phase chromatography at

18 a flow rate of 0.3 ^l/min in an Acclaim Peptide Map RSLC 75^m (column ID) x 150mm

20 (column length) x 2^m (particle size) analytical column, using the gradient: 0min, 3%B;

22 3 min, 3%B; 90 min, 40%B; 100 min, 50%B; 102 min, 90%B; 108 min, 90%B; 110 min,

24 3%B; 125 min, 3%B. Electrospray parameters used were: CUR=20; CAD=high; IS=2800;

26 GS1=20; GS2=0 and IHT=150. The collision energy and declustering potential applied to

28 each peptide was calculated with the skyline software. Dwell time for each transition was 20

34 cession code PASS00925.

ms for the synthetic heavy peptides and 100 ms for the endogenous peptides.

The raw MS proteomics data have been deposited in PeptideAtlas20 PASSEL with ac-

Results and discussion

Sample selection based on the PRIDE database content

We found 601 proteotypic peptide candidates in the neXtProt (release 20150101) in 65 PRIDE projects, which suggest the presence of 102 missing proteins of the chromosome 16 with 2630 PSMs. The number of detected peptides in each project is shown in Fig. 2. This barplot was used to select the project accession codes in which the expected number of missing proteins of chromosome 16 was higher (at least 50 peptides associated to missing proteins).

However, the PRIDE database is constantly changing, incorporating experiments as new

10 11 12

20 21 22

[i 100 С

PXD001383 (163 peptides) - HEK-293T cell line PRD000073 (l00 peptides) - Blood plasma PXD000004 (77 peptides) - Frontal cortex PXD000605 (77 peptides) - Blood plasma PRD000269 (50 peptides) - Aorta thoracica

Шшшп с

□□□OOOaaaaaaaaa========================

COCO^^GiCOCDt^COt^COCO^COCDCOGi^^^^tNCOCO^CDC^OCDC^^^CO^T-Git^C^COC^^tN^tNtN^^COt^O^COO

(ООСЧОСЧСОСОСО^^ОЭСЧСОс—СЧСОСЧ^СОЮ^^^^^^СО^^ЮЮСООЭСЧСОСЧ^'ОЭСЧт-СЧ

■t-OOOOOOOOOOt-t-OOOOOOOOOOOOOOOOOOt-OOOOOOOt-1-OOOOOOOOOOOOOOOOOOOOt-t-t-t-00000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000 QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ XCCXXCCXCCXXCCCCXXCCXCCXXXXCCCCXXCCCCCCCCCCCCXXCCCCXXXXXXXCCCCCCCCCCCCCCCCCCCCCCXXXXXXXXXXXXX Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q_Q^

PRIDE project accession

Figure 2: Number of proteotypic peptides of chromosome 16 missing proteins in the neXtProt database that were detected in the shotgun MS/MS experiments stored in the PRIDE database. The experiments selected for further analyses are highlighted in red.

proteomic datasets are submitted. We tried to consider this dynamic behavior as far as possible and included new samples in the study during the development of the project. Consequently, we included 4 samples from rare biological sources, since it had been proved that these samples can be used to detect missing proteins:13 spermatozoid,13 seminal plasma,26 retina27 and placenta.28 In addition to that, we included a most recent proteome characterization of the HEK293 cell line29 in replacement of the experiment with PRIDE accession number PXD001383. The list of projects selected from the PRIDE database for analysis is shown in Table 1.

In silico analysis of the neXtProt database

The total number of peptides obtained was 7031853 (2958508 unique peptides) and 8.81 % of the unique peptides corresponded to missing proteins. The mean number of peptides per protein for the missing proteins was 116, whereas the mean number of peptides for the non-missing proteins was 180. This was in accordance with a previous analysis of the features of the missing proteins17 in which it is shown that these proteins are shorter. The set of

10 11 12

20 21 22

Table 1: Project accessions of the PRIDE database selected for the identification of missing proteins. The number of samples and fractions analyzed in this study are shown.

Project Accession Tissue Instrument ft samples ft fractions

PXD001468 HEK293 Q Exactive 1 24

PXD002367 Spermatozoid LTQ Orbitrap 1 21

PXD001242 Retina LTQ Orbitrap Elite 5 60

PXD000754 Placenta LTQ Orbitrap 2 47

PXD000605 Blood plasma LTQ Orbitrap 3 146

PXD000004 Frontal cortex Q Exactive 5 14

PRD000269 Aorta thoracica LTQ Orbitrap 1 108

PXD002145 Seminal plasma LTQ Orbitrap Elite 2 96

proteotypic peptides (tryptic peptides not shared among proteins of the neXtProt database) was generated using in-house scripts. The number of proteotypic peptides ranging from 9 to 30 amino acids in length was 826137, 10.59 % of which were assigned to missing proteins (87545 peptides).

The number of tryptic and proteotypic peptides discovered using the amino acid sequences of the neXtProt database for each chromosome is shown in Fig. 3A and Fig. 3B respectively. The mean number of proteotypic peptides per chromosome was 3498 for the missing proteins and 29552 for the nonmissing proteins. The number of proteins that contained at least one tryptic peptide with a length between 9 and 30 amino acids was 20028. Interestingly, 19410 proteins, almost all of the proteins detectable with tryptic peptides, had also at least one proteotypic peptide. The number of missing proteins that could be detected by at least one proteotypic peptide was 2533, 94.94 % of the missing proteins in the neXtProt database. There were 2496 with two or more proteotypic peptides, 37 with only one and 135 without any predictable tryptic and proteotipic peptide, which will not be detectable according to the HPP guidelines using trypsin (Supporting Information Table 1). For these 95 proteins, other experimental approaches must be developed, for example the use of other enzyme for protein digestion.

In Fig. 3C and Fig. 3D we represent the distribution of these proteins across chromosomes. The mean number of proteins with at least one tryptic peptide per chromosome was 801 and

10 11 12

20 21 22

1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 16 17 18 19 20 21 22 X Y MT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

■Nonmissing proteins ■Missing proteins

■Nonmissing proteins ■Missing proteins

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

Figure 3: (A) Distribution of tryptic peptides deduced from in silico digestion of the neXtProt database (release 20150901) along chromosomes. (B) Distribution of proteotypic peptides deduced from the in silico digestion of the neXtProt database (release 20150901) along chromosomes. (C) Distribution of proteins with at least one tryptic peptide after the in silico digestion of the neXtProt database (release 20150901) along chromosomes. (D) Distribution of proteins with at least one proteotypic peptide after the in silico digestion of the neXtProt database (release 20150901) along chromosomes.

150000

p100000

<J> 500

that with at least one proteotypic peptide was 777. In the case of the missing proteins, the average number of proteins per chromosome with at least one proteotypic peptide was reduced to 101 proteins.

With regard to Chromosome 16, there are 836 proteins with at least one tryptic peptide

In order to perform the analysis of the missing proteins for all the chromosomes, we used more than 5 million spectra that were available in the selected projects from the PRIDE database. After the independent analysis of each of the experiments downloaded from the PRIDE database following the HPP guidelines, we assigned 503054 of these spectra (9.77 %)

12 and 813 with at least one proteotypic peptide with a length between 9 and 30 amino acids.

14 11.12% of tryptic proteins and 11.19% of proteotypic proteins are still considered missing

16 proteins (93 tryptic and 91 proteotypic proteins).

20 Identification of conclusive missing proteins

31 and we identified 5284 proteins with 1 or more proteotypic peptides and 3950 proteins with

33 2 or more proteotypic peptides. We detected 58 missing proteins with 1 or more proteotypic

35 peptides and 32 proteins with 2 or more proteotypic peptides (Supporting Information Table

37 3).

39 The results from each sample analysis are summarized in Table 3. Spermatozoid

41 (PXD002367) and the HEK293 cell line (PXD001468) were the samples with the higher

43 number of missing proteins detected. This result was consistent with previous analyses of

45 the spermatozoid proteome13 and it revealed the HEK293 cell line as a new biological source

47 of missing proteins. However, we did not found any evidence of the presence of missing pro-

teins in placenta (PXD000754), blood plasma (PXD000605), frontal cortex (PXD000004), aorta thoracica (PRD000269) and seminal plasma (PXD002145) samples.

10 11 12

20 21 22

Table 2: Parameters used in the Mascot search engine for the analysis of each downloaded project from the PRIDE database.

Precursor Fragment

Project mass mass Missed Fixed Variable

Accession tolerance (ppm) tolerance (Da) cleavages modifications modifications

PXD001468 20 0.05 2 Carbamidomethyl (C) Oxidation (M)

PXD002367 10 0.5 2 Carbamidomethyl (C) Oxidation (m) Acetyl (Protein N-term)

PXD001242 20 0.05 2 Carbamidomethyl (C) Oxidation (M)

PXD000754 20 1 2 Carbamidomethyl (C) Oxidation (M)

PXD000605 20 0.05 2 iTRAQ4plex114 (K) iTRAQ4plex114 (Y)

Methylthio (C) Oxidation (M)

Carbamidomethyl (C) Oxidation (M) Label: 13C(6) (K)

PXD000004 20 0.05 2

PRD000269 20 0.05 2 Carbamidomethyl (C) Oxidation (M)

PXD002145 10 0.5 2 Carbamidomethyl (C) Oxidation (M) Acetyl (Protein N-term)

Table 3: Number of PSMs, peptides and proteins identified using the HPP guidelines (PSM FDR < 1 %, protein FDR < 1 %) in the samples selected from PRIDE for the analysis of the missing proteins. FP: False positives.

PXD001468 PXD002367 PXD001242 5 7 0 0 0 D X fc PXD000605 PXD000004 PRD000269 PXD002145 Total

Spectra 836145 114970 452880 519326 1299378 357899 370218 1198042 5148858

Total PSMs 328554 48609 110624 80213 19086 136506 21969 6676 752237

FP PSMs 161 34 136 201 5 154 11 116 818

Total Peptides 68377 9848 14413 10122 1228 16679 2001 199 93012

Total Peptides (proteotypic) Total Peptides (non-proteotypic) FP Peptides 24510 43867 70 3990 5858 12 5393 9020 20 4226 5896 46 788 440 2 5737 10942 41 746 1255 3 56 143 8 33756 59256 202

Total Proteins 7206 1437 2681 2127 363 2340 351 54 8712

Total Conclusive Prot 4539 909 1501 1140 146 1069 193 29 5626

FP Proteins 33 8 15 11 2 33 1 8 111

Total Assigned Spectra 191095 24736 66707 51392 18091 133602 14936 2495 503054

Missing PSMs 798 473 117 0 0 0 0 0 1388

Missing Peptides 83 258 25 0 0 0 0 0 357

Missing Proteins 10 47 5 0 0 0 0 0 60

Missing Assigned Spectra 479 367 68 0 0 0 0 0 914

Total Proteins HPP (> 1 peptide) Total Proteins HPP (> 2 peptides) Missing Proteins HPP (> 1 peptide) Missing Proteins HPP (> 2 peptides) 4276 3326 10 5 888 750 45 27 1450 1260 5 1 1115 1000 0 0 146 120 0 0 1053 924 0 0 188 169 0 0 28 22 0 0 5284 3950 58 32

Detection of missing proteins using proteotypic peptides

Our objective is to increase the number of missing protein detections in the human proteome using the selected PRIDE datasets with an alternative bioinformatics pipeline based on the identification of proteotypic peptides deduced from the proteins of interest. In this strategy, we retained the protein identifications that failed to pass the FDR criteria at protein level of 1 %. The PSMs obtained with the Mascot search engine (search parameters were previously shown in Table 2) with PSM FDR < 1 % were used to identify all potential tryptic peptides from the proteins present in the samples (Supporting Information Table 2). Finally, this set of peptides were intersected with the proteotypic peptides found after the in silica digestion of the neXtProt database.

10 11 12

20 21 22

25 This approach allowed us to detect a total of 6333 proteins, 1049 more than the proteins

27 identified with the HPP guideline analysis. With regard to the number of peptides identified,

29 we obtained 35922 proteotypic peptides with PSM FDR < 1 %, representing an increase of

31 6.42 % over the peptides detected with the previous method. In order to achieve these

33 results, 515506 spectra were assigned, a slight increase (0.24 %) in the percentage of spectra

35 used from the total number of spectra available in the datasets. This led to the the inclusion

37 of 12452 new spectra in the analysis (Table 4).

39 The mean value of the FDR estimation at protein level was 8 %. This value is higher

41 than the threshold recommended by the HPP guidelines, but provided high-quality results

43 after a manual curation of the assigned spectra.

Focusing on missing proteins, 122 were potentially identified (Supporting Information Table 4), 62 proteins more than those detected as conclusive proteins by PAnalyzer, and only 242 peptides were needed compared with the 357 peptides obtained after the protein inference process. Seminal plasma (PXD002145) and blood plasma (PXD000605) were the only samples where we did not find any evidence of the presence of missing proteins. We also observed differences in the number of spectra assigned, 320 in this analysis and 914 in the previously described. This result is consistent with the basis of our method, since it only

10 11 12

20 21 22

Table 4: Number of PSMs, peptides and proteins observed using the identifications of pro-teotypic peptides from neXtProt database (PSM FDR < 1 %) in the samples selected from PRIDE for the analysis of the missing proteins.

GO 7 2 5 4 9 5

to 6 4 5 0 0 6 4

^ 3 2 7 6 0 2

2 0 0 0 0 2

O 0 0 0 0 0 0 0

o 0 0 0 0 0 0 0

n n n n n n n a al

X X X X X X X X ot

P P P fc P P P P T

Spectra 836145 114970 452880 519326 1299378 357899 370218 1198042 5148858

Total PSMs 332417 49100 115861 82856 19182 138704 23199 6676 767995

Total Peptides 71277 10311 16329 11739 1271 17570 2521 199 98319

Total Peptides 25734 4187 6259 4848 804 6092 988 56 35922

(proteotypic)

Total Peptides 45543 6124 10070 6891 467 11478 1533 143 62397

(non-proteotypic)

Total Proteins

5341 1293 2420 2208 245 1929 569 41 6333

(> 1 peptide)

Total Proteins

3326 750 1260 1000 120 924 169 22 3950

(> 2 peptides)

Total Assigned Spectra 193971 25083 71118 53398 18187 135285 15969 2495 515506

Missing PSMs 96 246 33 22 0 14 4 0 415

Missing Peptides 48 163 14 10 0 8 4 0 242

Missing Proteins 30 67 14 10 0 8 4 0 122

(> 1 peptide)

Missing Proteins (> 2 peptides) 8 30 3 2 0 2 2 0 39

Missing Assigned Spectra 62 195 29 16 0 14 4 0 320

allows for proteotypic peptide detection.

Peptide distribution along chromosomes showed that the number of proteotypic peptides was a small fraction of the total number of peptides observed (Fig. 4A). Moreover, we obtained a statistically significant lower Mascot ion score for the peptides from the missing proteins compared with the ion score of the peptides from the nonmissing proteins (t-test statistic with a p-value < 1E-16, see Fig. 4B). Unsurprisingly, the proportion of missing proteins detected per chromosome was very small, although assignements were made in all chromosomes with the only exception of mitochondria (Fig. 4C). As might be expected, the comparison of the missing proteins detected by the two bioinformatic pipelines (Fig. 4D) showed that the majority of proteins detected using HPP guidelines were included in the set of missing proteins with proteotypic peptides.

In spite of the tissue specificity of the missing proteins, we found a few examples of shared proteins between samples (Fig. 5A). The diagonal of the heatmap represents the number of missing proteins identified in each sample, and the rest of the matrix is filled with the number of missing proteins common to each pair of samples. It is easy to verify that the majority of the protein identifications were sample specific. The visualization of the results was improved using a network to represent in an effective way the relations between the

10 11 12

20 21 22

0 ■g

a. 0 a.

Nonmissing proteins Missing proteins

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 X Y

■ Nonmissing proteins

■ Missing proteins

Conclusive missing proteins Conclusive missing proteins (HPP guidelines, 2 or more (HPP guidelines, 1 or more proteotypic peptides) proteotypic peptides)

Missing proteins ----^ Missing proteins

(proteotypic peptides analysis) (neXtProt_20150901)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y MT

Figure 4: (A) Distribution of tryptic and proteotypic peptide candidates detected in the analyzed samples along the different chromosomes. (B) Boxplot with the distribution of Mascot ion scores obtained for the PSMs assigned to missing and nonmissing proteins. The difference between this distributions is statistically significant with a p-value < 1E-12. (C) Distribution of missing and nonmissing proteins potentially detected in the analyzed samples using the identification of proteotypic peptides along chromosomes. (D) Venn diagram with the missing proteins observed using the HPP guidelines and the workflow proposed here and with the missing proteins in neXtProt database release 20150901.

4 samples studied, the missing proteins observed and the peptides detected (Fig. 5B). This

6 graph could be completed including the results of the analysis of more proteomic datasets

8 in order to generate the network of the missing proteins of the human proteome.

12 Missing proteins identified in the HEK293 cell line or from chro-

14 mosome 16

17 The network of the missing proteins was used to extract information about the proteins

19 observed to perform functional analysis of protein sets and to select peptides for validation.

21 In our case, we decided to continue the analysis of the missing proteins detected in the

23 HEK293 cell line (PXD001468, shown in Fig. 5C) as it is differential from previous studies

25 focused on testis and sperm or encoded by chromosome 16 genes as this is the chromosome

27 adopted by the Spanish team in the chromosome 16 (Fig. 5D). In Table 5 the 34 proteins

37 Functional analysis of the missing proteins

40 The functional analysis of the list of the 182 missing proteins detected was performed and a

42 good correlation between the results obtained and the sample types analyzed was found. We

44 used DAVID v6.7 software30 for the analysis of GO terms, INTERPRO domains, KEGG

46 pathways, PANTHER pathways and UNIGENE quantile expression levels gene sets us-

48 ing the whole human proteome as the background list of proteins. The statistical analy-

ACS Paragon Plus Environment

corresponding to 33 known genes found in the HEK293 cell line or chromosome 16 are shown. These proteins are a subset of the total of the proteins obtained using the detection of proteotypic peptides (Supporting Information Table 4).

sis was performed using default parameters, and although the p-value was corrected using the multiple hypothesis methods (including FDR) the selection of enriched categories was based on a criterion of EASE Score < 0.1 as suggested by the bioinformatics tool. Using these recommended settings we found a list of enriched categories related with specific func-

10 11 12

20 21 22

Spermatozoa

FrontalCortex

JN7 NX_Q96M86

91 DNHD1

ADEVEQSPKPK

NX_A6NN73 GOLGA8CP •

TQTISLGQGQGP

FADDLGMGGT

X_Q2VIQ3 NX_P59817 KIF4B ZNF280A

NX Q9BXX2

ANKRD30B NX_Q9Y2H8 ZNF510

AGEPFTEFFSIPFVEER

VSELIIPTMETAR

YCLSQNPSLDR

Figure 5: (A) Heatmap with the missing proteins potentially detected in each sample and the missing proteins shared between each pair of samples analyzed. (B) Network representation of the results obtained for the study of the missing proteins using the PRIDE database. Nodes represent the database of experiments used (green), the tissue (orange), the proteins observed (red) and the identified peptides (blue). (C) Network for the missing proteins potentially observed in the HEK293 sample. Nodes represent the sample selected (green), the chromosome (blue) and the identified protein (red). (D) Network for the missing proteins potentially detected in chromosome 16. Nodes represent the sample (orange), the proteins observed (red) and the identified peptides (blue).

Placenta

HEK293

BloodPlasma

10 11 12

20 21 22

Table 5: Missing proteins potentially identified using proteotypic peptide candidates in the HEK293 cell line or in chromosome 16.

Protein Name Chr JPSMs ^Peptides Ion score HPP guidelines (2 proteotypic peptides) Sample

NXA6NJT0 UNCX 7 8 4 113.22 HEK

NX.B2RXH8 HNRNPCL2 1 276 15 102.61 HEK,Retina

NX.Q9BQ87 TBL1Y Y 76 10 100.66 HEK

NX.Q2VIQ3 KIF4B 5 46 19 99.77 ✓ HEK

NX_Q6IS14 EIF5AL1 10 298 17 95.03 ✓ HEK

NX.Q5T2N8 ATAD3C 1 56 8 85.06 ✓ HEK,Retina

NX.Q56UQ5 - X 55 4 81.79 ✓ HEK

NX.Q8TD57 DNAH3 16 27 25 80.77 ✓ Spermatozoa,Retina

NX.Q6URK8 TEPP 16 17 10 79.62 Spermatozoa

NX.Q9NRJ5 PAPOLB 7 10 3 77.63 ✓ HEK

NX.Q6ZR08 DNAH12 3 34 23 75.04 ✓ Placenta,HEK,Spermatozoa

NX_A8K0S8 MEIS3P2 17 6 1 58.75 ✓ HEK

NX.Q6ZMV8 ZNF730 19 3 3 58.21 ✓ HEK

NX.Q14585 ZNF345 19 1 1 57.3 ✓ HEK

NX.Q52M93 ZNF585B 19 1 1 54.08 ✓ HEK

NX_Q9UJN7 ZNF391 6 4 3 53.79 ✓ HEK

NX_P58180 OR4D2 17 3 1 52.28 ✓ HEK,Spermatozoa

NX.Q8NGL6 OR4A15 11 3 1 52.28 Spermatozoa,HEK

NX_P59817 ZNF280A 22 1 1 48.17 HEK

NX_A6NHN6 NPIPB15 16 7 5 47.7 ✓ Spermatozoa

NXQ9Y2H8 ZNF510 9 1 1 45.02 HEK

NXQ96KX1 C4orf36 4 1 1 44.7 ✓ HEK

NX_Q96M86 DNHD1 11 1 1 44.16 HEK

NX_Q5VTU8 ATP5EP2 13 1 1 43.65 ✓ HEK

NXQ8N0W5 IQCK 16 1 1 43.57 ✓ Spermatozoa

NX.Q4AC99 ACCSL 11 1 1 43.57 ✓ HEK

NXA6NNF4 ZNF726 19 2 1 43.39 ✓ HEK

NXP0CW27 CCDC166 8 1 1 40.88 ✓ HEK

NXA6NCM1 IQCA1L 7 1 1 40.63 ✓ HEK

NXQ8NDH2 CCDC168 13 1 1 40.58 ✓ HEK

NXQ6R2W3 ZBED9 6 1 1 40.51 ✓ HEK

NXA6NN73 GOLGA8CP 15 1 1 40.35 ✓ HEK

NXQ9H2H0 CXXC4 4 1 1 39.19 ✓ HEK

NXQ9BXX2 ANKRD30B 18 3 2 39.01 ✓ Aorta,HEK

tions carried out by these proteins in the samples analyzed (Supporting Information Table 5). First, the tissue specific expression gene ontology analysis for these genes using the "UNIGENE_EST_QUARTILE" expression profile database showed statistical enrichment in "brain normal" with 44 genes and a p-value = 0.003, "embryo development" with 51 genes and a p-value = 0.01 and "testis normal" with 67 genes and a p-value = 2.73E-14 confirming the sample specificity of the missing proteins detected. The results of the enrichment analysis of GO terms showed categories previously related to spermatozoa function13 such as "microtubule-based movement", "sexual reproduction", "integral to membrane" or "motor activity". Other enriched categories were related to brain tissues or neurological processes such as "sensory perception", "neurological system process", "cognition" or "postsynaptic membrane". Finally, others were involved in cell differentiation ("transcription" or "DNA binding"). We also compared the categories obtained with those previously defined with a similar functional analysis of all the missing proteins17 and many overlaps were found: "G-protein coupled receptor protein signaling pathway", "integral to membrane", "olfactory

4 receptor activity", or some Interpro domains ("zinc finger, C2H2-type", "GPCR, rhodopsin-

6 like superfamily").

8 A complementary functional and pathway analysis of this protein set was carried out

10 using QIAGEN Ingenuity Pathway Analysis (www.ingenuity.com). As expected, we found

12 a lack of enrichments or networks of interest due to the curated database on which this

14 software is based. The missing proteins are proteins without experimental evidence and in

16 most of the cases this is linked to scarce bibliographic information about them or their cod-

18 ing genes. However, interesting relationships were found between the protein WBP2NK (a

20 sperm-specific WW domain-binding protein that promotes meiotic resumption and pronu-

22 clear development during oocyte fertilization) and "reproductive system development and

24 function"; proteins CNGA2 (Cyclic Nucleotide Gated Channel Alpha 2) and PLCZ1 (Phos-

26 pholipase C, Zeta 1, a protein that localizes to the acrosome in spermatozoa and elicits

38 Manual evaluation of PSMs and selection of peptides

41 As we have previously mentioned, the estimated protein FDR for the proteins selected was 8

43 %. In order to minimize the influence of this value on the quality of the results, we performed

45 additional filtering steps to select the peptides for experimental validation. First, we selected

47 the peptides with less than 20 amino acids in length, due to the limitation in the synthesis

49 of heavy peptides for SRM/MRM experiments. For each remaining peptide of chromosome

51 16 or the HEK293 cell line, we chose its best PSM using the maximum Mascot ion score.

53 This resulted in a total of 59 peptides, 43 of which were observed in the HEK293 cell line

55 and 16 of which were observed chromosome 16.

58 The last stage consisted of a manual curation of the assigned spectra by three mass

Ca(2+) oscillations and egg activation during fertilization) and "sperm mobility"; and UNXC (UNC Homeobox, a transcription factor involved in somitogenesis and neurogenesis and required for the maintenance and differentiation of particular elements of the axial skeleton) and "embryonic development".

10 11 12

20 21 22

—T-H—Q---I-T-H— I -H-S--f — L-+GH-- Q -4 G i— Q —i- G P+- I --■t-A+A-^-K--H —K—i-AH-A-I—I—H—P—fGH—Q—i-G—Q—I-GH—L—I-SH—I—I—TH— Q—t—T—

b4 444.25

y10 926.62

b9 829.44

y3 289.24

b2 230.11

556.43 829.44 y9

y7 1869.5« 701.38 1 \

y12 1126.67

b10 1014.52

1039.68

1071.54 b13 1281.74

600 m/z

b14 1352.76

1200 1400 1600 1800

500 T 400 zi 300

J 200 100

y10 y11 940.52 997.54

y2 y4 y9

304.16 474.27 y7 843.47

y1 175.12

258.11 403.23 y5 545.30

y8 786.46

800 1000 m/z

1200 1400 1600 1800

Figure 6: (A) Spectra assignment of peptide LYSSLLDEIR from protein NX_Q8TD57 (DNAH3, chromosome 16) detected with Mascot ion score 75.99 in spermatozoa. (B) Spectra assignment of peptide TQTISLGQGQGPIAAK from protein NXQ8TD57 (DNAH3, chromosome 16) detected with Mascot ion score 80.77 in spermatozoa. (C) Spectrum assignment of peptide DAASCGPGAAVAAVER from protein NXA6NJT0 (UNCX, chromosome 7) detected with Mascot ion score 113.22 in HEK293 cell lines.

800 1000

-R—i—E—^V-HAHA-n-V-+-A~bA-HG^P-HG+-C-HS-+-A—HA--HD

10 11 12

20 21 22

spectrometry experts. The 59 spectra were visualized and evaluated using the software SeeMS 3.0.7106.0 from ProteoWizard platform for proteomics data analysis,31 according to the following features:32 (a) the quality of the y-ion and b-ion series assignments; (b) the peak intensities and observed signal-to-noise ratio; (c) the number of nonassigned peaks. Only the PSMs considered as "high quality" by the three experts were considered for further analysis. For illustrative purposes, we show in Fig. 6 4 PSMs corresponding to 2 peptides selected from chromosome 16 (Fig. 6A and Fig. 6B) and 1 peptides from the HEK293 cell line (Fig. 6C and Fig. 6D). The complete list of the 17 peptides selected for validation by SRM/MRM can be found in Table 6.

Table 6: Peptides selected for validation using targeted proteomics (SRM/MRM).

Protein Name Peptide Chr Sample Ion score Missing in neXtProt20160111 HPP guidelines (2 proteotypic peptides)

NC_A6NJT0 UNCX DAASCGPGAAVAAVER 7 HEK 113.22 ✓ ✓

NC_Q9BQ87 TBL1Y IWTENGNLASTLGQHK Y HEK 93.62 ✓ ✓

NC.Q8TD57 DNAH3 TQTISLGQGQGPIAAK 16 Spermatozoa 80.77 ✓

NC.Q8TD57 DNAH3 LYSSLLDEIR 16 Spermatozoa 75.99 ✓

NC.Q2VIQ3 KIF4B EMCDMEQVLSK 5 HEK 67.29 ✓ ✓

NC.Q5T2N8 ATAD3C AAGTLFGEGFR 1 HEK 66.45 ✓

NC.Q2VIQ3 KIF4B NLELEVINLQK 5 HEK 64.73 ✓ ✓

NC_A8K0S8 MEIS3P2 MVQPMIDQSNR 17 HEK 58.75 ✓

NC.Q8TD57 DNAH3 EANVAAAIAQGIK 16 Spermatozoa 49.37 ✓

NC_A6NHN6 NPIPB15 ADEVEQSPKPK 16 Spermatozoa 47.7 ✓

NC.Q8N0W5 IQCK AGEPFTEFFSIPFVEER 16 Spermatozoa 43.57 ✓

NC.B2RXH8 HNRNPCL2 MIASQVAVINLAAEPK 1 HEK 43.42 ✓ ✓

NC.Q8TD57 DNAH3 VESVLFPELK 16 Spermatozoa 39.34 ✓

NC.Q8TD57 DNAH3 DFDLEEVMK 16 Spermatozoa 37.96 ✓

NC.Q8TD57 DNAH3 AVVFVDDLNMPAK 16 Spermatozoa 36.67 ✓

NC.Q8TD57 DNAH3 GNILEDETAIK 16 Spermatozoa 36.09 ✓

NC.Q6URK8 TEPP YCLSQNPSLDR 16 Spermatozoa 31.36

Although all the analyses and the selection of the peptides for validation was carried out using the neXtProt database 20150901, the release of a new version (20160111) compelled us to compare the results at this stage with the new list of missing proteins. As shown in Table 6, all the selected proteins except DNAH3 and TEPP (with new evidence in the spermatozoa sample) were still considered missing proteins in the new release.

Validation of missing protein identifications using SRM/MRM

In order to validate the identifications of the missing proteins two experimental strategies were designed. First, a sample with a mixture of the heavy peptides for the 23 peptides selected for validation was analyzed using MIDAS (MRM-initiated detection and sequencing),

10 11 12

20 21 22

a method in which the mass spectrometer (ABSciex QTrap5500) switches from MRM to enhanced product ion scanning mode when an individual MRM is detected. Data were examined manually to verify the chromatographic peaks and the transitions detected for each peptide, and 18 peptides were used for further analysis (Supporting Information Table 6). The MS/MS spectra of the heavy precursors were searched with Mascot against neXtProt database using the target-decoy strategy with the following parameters: precursor tolerance 0.8 Da, fragment tolerance 0.6 Da, two missed cleavages, carbamidomethyl cysteine as a fixed modification and oxidized methionine as variable modification. The identification of proteins was performed with a criterion of PSM FDR < 1 %, protein FDR < 1 % and PAnalyzer to select only conclusive proteins. We detected 14 of the synthetic peptides, corresponding to 8 missing proteins (Table 7). In Fig. 7 we show the comparison between a selection of the fragmentation spectra obtained for the heavy peptides and the corresponding endogenous spectra found in the shotgun experiments for the peptides VESVCFPELK (DNAH3), EANVAAAIAQGIK (DNAH3) and AAGTLFGEGFR (ATAD3C) using the SDPScore.33

Table 7: Results of the mascot search of the heavy peptide sample using the neXtProt database. Conclusive proteins according to PAnalyzer were selected (PSM FDR < 1 %, Protein FDR < 1 %)

Peptide Protein Chr Name Max ion score Missing in ( neXtProt20160111)

DAASCGPGAAVAAVER NX_A6NJT0 7 UNCX 78.49 ✓

VESVLFPELK NX.Q8TD57 16 DNAH3 75.29

AAGTLFGEGFR NX.Q5T2N8 1 ATAD3C 56.93 ✓

MIASQVAVINLAAEPK NXB2RXH8 1 HNRNPCL2 55.83 ✓

ADEVEQSPKPK NX.A6NHN6 16 NPIPB15 54.39 ✓

EANVAAAIAQGIK NX.Q8TD57 16 DNAH3 52.72

EMCDMEQVLSK NX.Q2VIQ3 5 KIF4B 48.67 ✓

YCLSQNPSLDR NX.Q6URK8 16 TEPP 46.26

MVQPMIDQSNR NX_A8K0S8 17 MEIS3P2 45.03 ✓

LYSSLLDEIR NX.Q8TD57 16 DNAH3 44.68

TQTISLGQGQGPIAAK NX.Q8TD57 16 DNAH3 44.15

AVVFVDDLNMPAK NX.Q8TD57 16 DNAH3 43.37

DFDLEEVMK NX.Q8TD57 16 DNAH3 39.71

GNILEDETAIK NX.Q8TD57 16 DNAH3 39.6

The final step of the validation process was the targeting of the selected peptides by SRM to detect them in the biological samples of interest (spermatozoa and the HEK293 cell line). This approach allowed us to confirm the presence of 4 peptides in spermatozoa (DNAH3) and an additional peptide for the protein ATAD3C in the HEK293 cell line, as can be seen in Fig. 8A. As we have mentioned before, the protein DNAH3 changed its evidence from

10 11 12

20 21 22

EANVAAAIAQGIK - DNAH3

VESVLFPELK - DNAH3

400 r 350 ; 300 : 250 : 200 : 150 :: 100 : 50 ; 0 --

524.58 y6 i 637.45 b6 b8 I 556.5 I 740.5

-Q----I--G-H—

— V---N-

y9 850.57

1000 1100

853.72 b8 , 901.32

300 400 500 600 700 800 900 1000 1100

l-GH-Q-1—A—I---1---h-A^A-^An—V —I---N---h

b5 842.62

485.24

771.52

400__|__5__|___V-

300 ' 250 200

1000 1100

200 300 400 500 600 700 800 900 1000 1100

AAGTLFGEGFR - ATAD3C

y7 993.60

835.56 y8 I 936.36 I

T-G — A--A--

200 400

800 1000

Figure 7: (A) Comparison of the MS/MS spectrum of peptide EANVAAAIAQGIK from DNAH3 protein obtained in the shotgun experiment (lower) and the MS/MS spectrum for its synthetic heavy peptide (upper) obtained in the LC-SRM experiment (SDPScore = 0.88). (B) Comparison of the MS/MS spectrum of peptide VESVCFPELK from DNAH3 protein obtained in the shotgun experiment (lower) and the MS/MS spectrum for its synthetic heavy peptide (upper) obtained in the LC-SRM experiment (SDPScore = 0.90). (C) Comparison of the MS/MS spectrum of peptide AAGTLFGEGFR from ATAD3C protein obtained in the shotgun experiment (lower) and the MS/MS spectrum for its synthetic heavy peptide (upper) obtained in the LC-SRM experiment (SDPScore = 0.89).

A—I———a—— -

5---1—V

- P---1---E---1---L---1---K + 8

y8 779.49

y8 940.54

y4 494.71

y7 708.46

641.39 754.46

y10 949.66

941.58

b3 315.13

939.43

b5 528.22

b4 414.24

b9 1014

y2 260.30

•-E---i-G-H---F---i--R + 10--

y5 575.40

4 missing protein to PE1 during the development of our study (neXtProt release 20160111).

6 The evidence at the the protein level was obtained from PeptideAtlas using a reanalysis of

7 the spermatozoa proteome.13 We validated this evidence in a set of independent samples

10 with the detection of 4 proteotypic peptides. In Fig. 8C-E we show the SRM/MRM signal

12 for three of these peptides and in Fig. 8B we show the SRM/MRM signal for the peptide

14 detected from ATAD3C in the HEK293 cell line.

16 Although we found only one peptide using LC-SRM in the HEK293 cell line, we suggest

18 increasing the number of experiments in this sample in order to validate the presence of

20 a large number of missing proteins using other experimental protocols or other proteomic

22 techniques, for example antibody-based technologies. We consider this finding as an oppor-

24 tunity to characterize proteins with potential interest in molecular and biology research. For

26 example, the proteins ATAD3C (ATPase Family, AAA Domain Containing 3C), validated

32 ment34'35 and tumorigenesis.36 More specifically, mutations of the ATAD3C gene have been

53 line provided support to the proteomics evidence for missing proteins candidates. Interest-

55 ingly, we were able to confirm the presence of the protein DNAH3 in spermatozoa and

57 ATAD3C in the HEK293 cell line using SRM/MRM. The protein DNAH3 was a missing

using the LC-SRM approach, and UNCX (UNC Homeobox) detected in the shotgun data analysis following the HPP guidelines, have been previously related to embryonic develop-

associated with colorectal cancer (COSMIC accession codes 2230025 and 2230026).

Collectively, the validation experiments carried out proved the success of the strategy described in this manuscript to detect missing proteins using the analysis of public high throughput proteomic datasets. The analysis of the shotgun experiments of the samples enriched in missing proteins from the chromosome 16 was able to detect high quality spectra assigned to a set of proteins defined as missing proteins in neXtProt release 20150901, although a small fraction of them are now considered as PE1 proteins in the current release (20160111). The analysis of synthetic heavy peptides for 17 selected peptides using a MIDAS approach and the LC-SRM analysis performed in spermatozoa sample and the HEK293 cell

10 11 12

20 21 22

AAGTLFGEGFR-ATAD3C

Synthetic peptides

Detected synthetic peptides (conclusive proteins)

Detected peptides in sperm MRM

EANVAAAIAQGIK - DNAH3

Detected peptides in HEK293 MRM

GNILEDETAIK - DNAH3

VESVLFPELK - DNAH3

Figure 8: (A) Venn diagram with the peptides selected for detection and the results of the different stages of the validation analysis. (B) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide AAGTLFGEGFR from ATAD3C in HEK293 cell line. (C) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide EANVAAAIAQGIK from DNAH3 in spermatozoa sample. (D) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide GNILEDETAIK from DNAH3 in spermatozoa sample. (E) Endogenous (upper) and synthetic heavy peptide (lower) LC-SRM signals measured for the peptide VESVLFPELK from DNAH3 in spermatozoa sample.

protein in the neXtProt release used in this study, but is considered as PE1 in the current release. This observation confirms the value of the developed strategy for the annotation of missing proteins. We also provided robust evidence supporting HEK293 cells as a promising source of missing proteins.

The complete characterization of the human proteome is an ambitious task which is being carried out jointly by proteomics laboratories worldwide in the framework of the HPP project.37 Despite the efforts made and the resources devoted to this issue since its start in 2001, no experimental evidence for 14.70 % of human proteins (neXtProt release 20160201) has yet been detected in any biological matrix. The detection of this set of proteins, known as the "missing proteins", is a huge challenge from the proteomics, bioinformatics and statistical points of view.8 In recent years, the analysis of the expression level of the protein coding

10 11 12

14 Conclusions

20 21 22

32 genes and their tissue specificity has revealed a map with the most probable location of each

34 missing protein in a wide variety of samples.17 However, the biochemical characteristics of

36 these proteins make their detection extremely challenging, especially if stringent statistical

38 thresholds are applied to established the likelihood of the observations.13'32 The contents of

40 a variety of databases of proteomic experiments have been gradually incorporated into the

42 project to define the reference human proteome, for example the PRIDE database. Using

44 the information about all the human PSMs stored in the this database we selected a set of

46 target samples (we found human spermatozoa and the HEK293 cell lines samples specially

48 enriched in missing proteins) and we compared two different methods of analysis of shotgun

datasets for the identification of missing proteins at proteome level.

In an attempt to provide new horizons and guidance on how and where missing proteins should be hunted for, we propose here a non-conventional bioinformatic pipeline that relies on the use of PRIDE datasets relaxing the statistical constraints to allow the selection of

PSM that suggest the presence of peptides from missing proteins, followed by a robust validation process. We used the in silico digestion of the protein reference database and the selection of unique peptides (proteotypic peptides) for all the proteome to filter those spectra assignments with PSM FDR < 1 %. With this method, without the need for protein

12 inference and protein FDR filtering we found 182 missing proteins candidates. However, in

14 this case the results had to be carefully analyzed by mass spectrometry experts to remove

16 low quality assignments and hence the remaining PSM entered the experimental validation

18 process based on SRM. From our findings, 17 peptides were selected for validation, and

20 heavy peptides were synthesized to validate the identification of 14 missing proteins with

22 SRM/MRM experiments. We identified 4 proteotypic peptides from the protein DNAH3

24 in the spermatozoa sample and one proteotypic peptide from the protein ATAD3C in the

26 HEK293 cell line using LC-SRM assays. Therefore, we have demonstrated the feasibility of

28 the study of missing proteins using an alternative method that combines the proper selection

40 CIMA, UV and CNIO laboratories are members of the PRBB-ISCIII platform. This study

42 was supported by: PRBB and the Carlos III National Health Institute Agreement, PRBB-

44 ISCIII; grants SAF2014-5478-R from Ministerio de Ciencia e Innovación and ISCIII-RETIC

46 RD06/0020 to FJC, grants 33/2015 from Dpto. de Salud of Gobierno de Navarra and

48 DPI2015-68982-R from Ministerio de Ciencia e Innovación to VS and grant BFU2012-39482

50 from Ministerio de Economa y Competitividad to MSP. J.A.V. and N.d.T are supported by

52 the Wellcome Trust [grant number WT101477MA].

of the target sample based on MS experiments from public databases and a statistical analysis based on the detection of certain peptides that uniquely defined the missing proteins.

Acknowledgement

Supporting Information Available

Supporting Information Table 1: List of missing proteins with the number of proteotypic peptides using the in silico digestion of neXtProt (release 20150901) with Proteogest software.

11 Supporting Information Table 2: Mascot search engine results obtained for the different

13 projects from the PRIDE database analyzed (PSM FDR < 1 %). Supporting Information

15 Table 3: Summary of the HPP guideline results. Supporting Information Table 4: List of

17 the peptides from the missing proteins obtained in the analysis of the PRIDE samples after

19 PSM FDR < 1 % filtering, intersected with the proteotypic peptides of neXtProt. Support-

21 ing Information Table 5: Functional characterization of the missing proteins detected using

23 the analysis of the proteotypic peptides of neXtProt database (183 proteins) with DAVID

25 software. Supporting Information Table 6: List of SRM/MRM transitions designed for the

27 detection of the peptides observed from the missing proteins in the shotgun experiments

29 with the ABSciex Qtrap5500. Supporting Information File 1: Supplemental methods.

39 (1) Legrain, P. et al. The human proteome project: Current state and future direction. Mol

41 Cell Proteomics 2011,

44 (2) Paik, Y.-K. et al. Standard guidelines for the chromosome-centric human proteome

46 project. J Proteome Res 2012, 11, 2005-2013.

49 (3) Paik, Y.-K. et al. The Chromosome-Centric Human Proteome Project for cataloging

51 proteins encoded in the genome. Nat Biotechnol 2012, 30, 221-223.

54 (4) Aebersold, R.; Bader, G. D.; Edwards, A. M.; van Eyk, J. E.; Kussmann, M.; Qin, J.;

56 Omenn, G. S. The Biology/Disease-driven Human Proteome Project (B/D-HPP): En-

This material is available free of charge via the Internet at http://pubs.acs.org/.

References

10 11 12

20 21 22

abling Protein Research for the Life Sciences Community. J Proteome Res 2013, 1, 23-27.

(5) Aebersold, R.; Bader, G. D.; Edwards, A. M.; van Eyk, J. E.; Kussmann, M.; Qin, J.; Omenn, G. S. Highlights of B/D-HPP and HPP Resource Pillar Workshops at 12th Annual HUPO World Congress of Proteomics. Proteomics 2014, 14, 1615-9861.

(6) Nilsson, T.; Mann, M.; Aebersold, R.; Yates, J. R., 3rd; Bairoch, A.; Bergeron, J. J. M. Mass spectrometry in high-throughput proteomics: ready for the big time. Nat Methods 2010, 7, 681-685.

(7) Segura, V. et al. Surfing transcriptomic landscapes. A step beyond the annotation of chromosome 16 proteome. J Proteome Res 2014, 13, 158-172.

(8) Horvatovich, P. et al. Quest for Missing Proteins: Update 2015 on Chromosome-Centric Human Proteome Project. J Proteome Res 2015, 14, 3415-3431.

(9) Gaudet, P.; Michel, P.-A.; Zahn-Zabal, M.; Cusin, I.; Duek, P. D.; Evalet, O.; Gateau, A.; Gleizes, A.; Pereira, M.; Teixeira, D.; Zhang, Y.; Lane, L.; Bairoch, A. The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res 2015, 43, D764-D770.

(10) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S. Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res 2014, 13, 15-20.

(11) Uhlen, M. et al. Proteomics. Tissue-based map of the human proteome. Science 2015, 347, 1260419.

(12) Djureinovic, D.; Fagerberg, L.; Hallström, B.; Danielsson, A.; Lindskog, C.; Uhlen, M.; Ponten, F. The human testis-specific proteome defined by transcriptomics and antibody-based profiling. Mol Hum Reprod 2014, 20, 476-488.

10 11 12

20 21 22

38 (18) Perez-Riverol, Y.; Alpi, E.; Wang, R.; Hermjakob, H.; Vizcaino, J. A. Making pro-

40 teomics data accessible and reusable: current state of proteomics databases and repos-

42 itories. Proteomics 2015, 15, 930-949.

45 (19) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating,

47 and storing protein identification data. J Proteome Res 2004, 3, 1234-1242.

50 (20) Farrah, T.; Deutsch, E. W.; Omenn, G. S.; Sun, Z.; Watts, J. D.; Yamamoto, T.;

52 Shteynberg, D.; Harris, M. M.; Moritz, R. L. State of the human proteome in 2013 as

54 viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for

56 the biology- and disease-driven Human Proteome Project. J Proteome Res 2014, 13,

58 60-75.

(13) Jumeau, F.; Com, E.; Lane, L.; Duek, P.; Lagarrigue, M.; Lavigne, R.; Guillot, L.; Rondel, K.; Gateau, A.; Melaine, N.; Guevel, B.; Sergeant, N.; Mitchell, V.; Pineau, C. Human Spermatozoa as a Model for Detecting Missing Proteins in the Context of the Chromosome-Centric Human Proteome Project. J Proteome Res 2015, 14, 3606-3620.

(14) Clough, E.; Barrett, T. The Gene Expression Omnibus Database. Methods Mol Biol 2016, 1418, 93-110.

(15) ENCODE Project Consortium,; Bernstein, B. E.; Birney, E.; Dunham, I.; Green, E. D.; Gunter, C.; Snyder, M. An integrated encyclopedia of DNA elements in the human genome. Nature 2011, 489, 57-74.

(16) Tabas-Madrid, D.; Alves-Cruzeiro, J.; Segura, V.; Guruceaga, E.; Vialas, V.; Prieto, G.; García, C.; Corrales, F. J.; Albar, J. P.; Pascual-Montano, A. Proteogenomics Dashboard for the Human Proteome Project. J Proteome Res 2015, 14, 3738-3749.

(17) Guruceaga, E.; Sanchez del Pino, M. M.; Corrales, F. J.; Segura, V. Prediction of a missing protein expression map in the context of the human proteome project. J Proteome Res 2015, 14, 1350-1360.

10 11 12

17 2016, 44, D447-D456.

20 21 22

24 2003, 1, 5.

31 large proteomics data sets generated by tandem mass spectrometry. Mol Cell Pro-

40 matics 2012, 13, 288.

47 After Spinal Cord Injury Using Quantitative Proteomics. Mol Cell Proteomics 2016,

49 15, 1424-1434.

52 (27) Zhang, P.; Dufresne, C.; Turner, R.; Ferri, S.; Venkatraman, V.; Karani, R.;

54 Lutty, G. A.; Van Eyk, J. E.; Semba, R. D. The proteome of human retina. Proteomics

56 2015, 15, 836-840.

(21) Ternent, T.; Csordas, A.; Qi, D.; Gomez-Baena, G.; Beynon, R. J.; Jones, A. R.; Hermjakob, H.; Vizcaino, J. A. How to submit MS proteomics data to ProteomeXchange via the PRIDE database. Proteomics 2014, 14, 2233-2241.

(22) Vizcaino, J. A.; Csordas, A.; Del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; Perez-Riverol, Y.; Reisinger, F.; Ternent, T.; Xu, Q.-W.; Wang, R.; Hermjakob, H. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res

(23) Cagney, G.; Amiri, S.; Premawaradena, T.; Lindo, M.; Emili, A. In silico proteome analysis to facilitate proteomics experiments using mass spectrometry. Proteome Sci

(24) Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein identification false discovery rates for very

teomics 2009, 8, 2405-2417.

(25) Prieto, G.; Aloria, K.; Osinalde, N.; Fullaondo, A.; Arizmendi, J. M.; Matthiesen, R. PAnalyzer: a software tool for protein inference in shotgun proteomics. BMC Bioinfor-

(26) da Silva, B. F.; Meng, C.; Helm, D.; Pachl, F.; Schiller, J.; Ibrahim, E.; Lynne, C. M.; Brackett, N. L.; Bertolla, R. P.; Kuster, B. Towards Understanding Male Infertility

10 11 12

20 21 22

46 (35) Li, S.; Rousseau, D. ATAD3, a vital membrane bound mitochondrial ATPase involved

48 in tumor progression. J Bioenerg Biomembr 2012, 44, 189-197.

51 (36) Mouradov, D. et al. Colorectal cancer cell lines are representative models of the main

53 molecular subtypes of primary cancer. Cancer Res 2014, 74, 3238-3247.

56 (37) Paik, Y.-K.; Hancock, W. S. Uniting ENCODE with genome-wide proteomics. Nat

58 Biotechnol 2012, 30, 1065-1067.

(28) Lee, H.-J. et al. Comprehensive genome-wide proteomic analysis of human placental tissue for the Chromosome-Centric Human Proteome Project. J Proteome Res 2013, 12, 2458-2466.

(29) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat Biotechnol 2015, 33, 743-749.

(30) Huang, D. W.; Sherman, B. T.; Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009, 4, 44-57.

(31) Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat Biotechnol 2012, 30, 918-920.

(32) Carapito, C. et al. Computational and Mass-Spectrometry-Based Workflow for the Discovery and Validation of Missing Human Proteins: Application to Chromosomes 2 and 14. J Proteome Res 2015, 14, 3621-3634.

(33) Ye, D.; Fu, Y.; Sun, R.-X.; Wang, H.-P.; Yuan, Z.-F.; Chi, H.; He, S.-M. Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinformatics 2010, 26, i399-i406.

(34) Sánchez, R. S.; Sanchez, S. S. Characterization of pax1, pax9, and uncx sclerotomal genes during Xenopus laevis embryogenesis. Dev Dyn 2013, 242, 572-579.

10 11 12

20 21 22

Graphical TOC Entry

Spermatozoa

neXtprot

Missing proteins