A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 9 Issue 1
Jan.  2022

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 7.847, Top 10% (SCI Q1)
    CiteScore: 13.0, Top 5% (Q1)
    Google Scholar h5-index: 64, TOP 7
Turn off MathJax
Article Contents
L. Hu, S. C. Yang, X. Luo, H. Q. Yuan, K. Sedraoui, and M. C. Zhou, “A distributed framework for large-scale protein-protein interaction data analysis and prediction using MapReduce,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 1, pp. 160–172, Jan. 2022. doi: 10.1109/JAS.2021.1004198
Citation: L. Hu, S. C. Yang, X. Luo, H. Q. Yuan, K. Sedraoui, and M. C. Zhou, “A distributed framework for large-scale protein-protein interaction data analysis and prediction using MapReduce,” IEEE/CAA J. Autom. Sinica, vol. 9, no. 1, pp. 160–172, Jan. 2022. doi: 10.1109/JAS.2021.1004198

A Distributed Framework for Large-scale Protein-protein Interaction Data Analysis and Prediction Using MapReduce

doi: 10.1109/JAS.2021.1004198
Funds:  This work was supported in part by the National Natural Science Foundation of China (61772493), the CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2020-004B), the Natural Science Foundation of Chongqing (China) (cstc2019jcyjjqX0013), Chongqing Research Program of Technology Innovation and Application (cstc2019jscx-fxydX0024, cstc2019jscx-fxydX0027, cstc2018jszx-cyzdX0041), Guangdong Province Universities and College Pearl River Scholar Funded Scheme (2019), the Pioneer Hundred Talents Program of Chinese Academy of Sciences, and the Deanship of Scientific Research (DSR) at King Abdulaziz University (G-21-135-38)
More Information
  • Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins. With the rapid development of high-throughput genomic technologies, massive protein-protein interaction (PPI) data have been generated, making it very difficult to analyze them efficiently. To address this problem, this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms, i.e., CoFex, using MapReduce. To do so, an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction. Respective solutions are then devised to overcome these limitations. In particular, we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins. After that, its procedure is modified by following the MapReduce framework to take the prediction task distributively. A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy. Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.


  • loading
  • [1]
    L. Hu, J. Zhang, X. Y. Pan, H. Yan, and Z. H. You, “HiSCF: Leveraging higher-order structures for clustering analysis in biological networks,” Bioinformatics, vol. 37, no. 4, pp. 542–550, May 2021. doi: 10.1093/bioinformatics/btaa775
    W. J. Zhu, X. K. Liu, M. L. Xu, and H. M. Wu, “Predicting the results of RNA molecular specific hybridization using machine learning,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 6, pp. 1384–1396, Nov. 2019. doi: 10.1109/JAS.2019.1911756
    L. Hu, X. H. Yuan, X. Liu, S. W. Xiong, and X. Luo, “Efficiently detecting protein complexes from protein interaction networks via alternating direction method of multipliers,” IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 16, no. 6, pp. 1922–1935, Nov.-Dec. 2019. doi: 10.1109/TCBB.2018.2844256
    L. Hu, X. J. Wang, Y. A. Huang, P. W. Hu, and Z. H. You, “A survey on computational models for predicting protein-protein interactions,” Brief. Bioinform., vol. 22, no. 5, 2021. DOI: 10.1093/bib/bbab036
    T. Dandekar, B. Snel, M. Huynen, and P. Bork, “Conservation of gene order: A fingerprint of proteins that physically interact,” Trends Biochem. Sci., vol. 23, no. 9, pp. 324–328, Sep. 1998. doi: 10.1016/S0968-0004(98)01274-2
    J. N. Wells, L. T. Bergendahl, and J. A. Marsh, “Operon gene order is optimized for ordered protein complex assembly,” Cell Rep., vol. 14, no. 4, pp. 679–685, Feb. 2016. doi: 10.1016/j.celrep.2015.12.085
    R. Jansen, H. Y. Yu, D. Greenbaum, Y. Kluger, N. J. Krogan, S. Chung, A. Emili, M. Snyder, J. F. Greenblatt, and M. Gerstein, “A Bayesian networks approach for predicting protein-protein interactions from genomic data,” Science, vol. 302, no. 5644, pp. 449–453, Oct. 2003. doi: 10.1126/science.1087361
    F. Pazos and A. Valencia, “Similarity of phylogenetic trees as indicator of protein-protein interaction,” Protein Eng., vol. 14, no. 9, pp. 609–614, Sep. 2001. doi: 10.1093/protein/14.9.609
    M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates, “Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles,” Proc. Natl. Acad. Sci. USA, vol. 96, no. 8, pp. 4285–4288, Apr. 1999. doi: 10.1073/pnas.96.8.4285
    A. Chowdhury, P. Rakshit, and A. Konar, “Protein-protein interaction network prediction using stochastic learning automata induced differential evolution,” Appl. Soft Comput., vol. 49, pp. 699–724, Dec. 2016. doi: 10.1016/j.asoc.2016.08.053
    M. A. Mahdavi and Y. H. Lin, “Prediction of protein-protein interactions using protein signature profiling,” Genomics Proteomics Bioinformatics, vol. 5, no. 3–4, pp. 177–186, Dec. 2007. doi: 10.1016/S1672-0229(08)60005-4
    L. Huang, L. Liao, and C. H. Wu, “Evolutionary analysis and interaction prediction for protein-protein interaction network in geometric space,” PLoS One, vol. 12, no. 9, Article No. e0183495, Sep. 2017. doi: 10.1371/journal.pone.0183495
    A. Ben-Hur and W. S. Noble, “Kernel methods for predicting protein-protein interactions,” Bioinformatics, vol. 21, no. Suppl 1, pp. i38–i46, Mar. 2005. doi: 10.1093/bioinformatics/bti1016
    J. W. Shen, J. Zhang, X. M. Luo, W. L. Zhu, K. Q. Yu, K. X. Chen, Y. X. Li, and H. L. Jiang, “Predicting protein-protein interactions based only on sequences information,” Proc. Natl. Acad. Sci. USA, vol. 104, no. 11, pp. 4337–4341, Mar. 2007. doi: 10.1073/pnas.0607879104
    T. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D. S. Somanathan, A. Sebastian, S. Rani, S. Ray, C. J. H. Kishore, S. Kanth, M. Ahmed, M. K. Kashyap, R. Mohmood, Y. L. Ramachandra, V. Krishna, B. A. Rahiman, S. Mohan, P. Ranganathan, S. Ramabadran, R. Chaerkady, and A. Pandey, “Human protein reference database-2009 update,” Nucleic Acids Res., vol. 37, pp. D767–D772, Jan. 2009. doi: 10.1093/nar/gkn892
    T. L. Sun, B. Zhou, L. H. Lai, and J. F. Pei, “Sequence-based prediction of protein protein interaction using a deep-learning algorithm,” BMC Bioinformatics, vol. 18, no. 1, Article No. 277, May 2017. doi: 10.1186/s12859-017-1700-2
    M. Kong, Y. S. Zhang, D. Xu, W. Chen, and M. Dehmer, “FCTP-WSRC: Protein-protein interactions prediction via weighted sparse representation based classification,” Front. Genet., vol. 11, Article No. 18, Feb. 2020. doi: 10.3389/fgene.2020.00018
    L. Hu and K. C. C. Chan, “Extracting coevolutionary features from protein sequences for predicting protein-protein interactions,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 14, no. 1, pp. 155–166, Jan.–Feb. 2017. doi: 10.1109/TCBB.2016.2520923
    E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg, “A combined algorithm for genome-wide prediction of protein function,” Nature, vol. 402, no. 6757, pp. 83–86, Nov. 1999. doi: 10.1038/47048
    A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis, “Protein interaction maps for complete genomes based on gene fusion events,” Nature, vol. 402, no. 6757, pp. 86–90, Nov. 1999. doi: 10.1038/47056
    M. H. Deng, S. Mehta, F. Z. Sun, and T. Chen, “Inferring domain-domain interactions from protein-protein interactions,” in Proc. 6th Annu. Int. Conf. Computational Biology, Washington, DC, USA, 2002, pp. 117–126.
    X. W. Chen and M. Liu, “Prediction of protein-protein interactions using random decision forest framework,” Bioinformatics, vol. 21, no. 24, pp. 4394–4400, Dec. 2005. doi: 10.1093/bioinformatics/bti721
    S. R. Maetschke, M. Simonsen, M. J. Davis, and M. A. Ragan, “Gene ontology-driven inference of protein-protein interactions using inducers,” Bioinformatics, vol. 28, no. 1, pp. 69–75, Jan. 2012. doi: 10.1093/bioinformatics/btr610
    S. Pitre, F. Dehne, A. Chan, J. Cheetham, A. Duong, A. Emili, M. Gebbia, J. Greenblatt, M. Jessulat, N. Krogan, X. M. Luo, and A. Golshani, “PIPE: A protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs,” BMC Bioinformatics, vol. 7, no. 1, Article No. 365, Jul. 2006. doi: 10.1186/1471-2105-7-365
    J. Zahiri, O. Yaghoubi, M. Mohammad-Noori, R. Ebrahimpour, and A. Masoudi-Nejad, “PPIevo: Protein-protein interaction prediction from PSSM based evolutionary information,” Genomics, vol. 102, no. 4, pp. 237–242, Oct. 2013.
    H. Li, X. J. Gong, H. Yu, and C. Zhou, “Deep neural network based predictions of protein interactions using primary sequences,” Molecules, vol. 23, no. 8, Article No. 1923, Aug. 2018. doi: 10.3390/molecules23081923
    I. A. Kovács, K. Luck, K. Spirohn, Y. Wang, C. Pollis, S. Schlabach, W. T. Bian, D. K. Kim, N. Kishore, T. Hao, M. A. Calderwood, M. Vidal, and A. L. Barabási, “Network-based prediction of protein interactions,” Nat. Commun., vol. 10, no. 1, Article No. 1240, Mar. 2019. doi: 10.1038/s41467-019-09177-y
    X. J. Wang, P. W. Hu, and L. Hu, “A novel stochastic block model for network-based prediction of protein-protein interactions,” in Proc. Int. Conf. Intelligent Computing, Bari, Italy, 2020, pp. 621–632.
    F. Yang, K. F. Fan, D. D. Song, and H. K. Lin, “Graph-based prediction of protein-protein interactions with attributed signed graph embedding,” BMC Bioinformatics, vol. 21, no. 1, Article No. 323, Jul. 2020. doi: 10.1186/s12859-020-03646-8
    H. Y. Yu, P. Braun, M. A. Yıldırım, I. Lemmens, K. Venkatesan, J. Sahalie, T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J. F. Rual, A. Dricot, A. Vazquez, R. R. Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Y. Fan, A. S. D. Smet, A. Motyl, M. E. Hudson, J. Park, X. F. Xin, M. E. Cusick, T. Moore, C. Boone, M. Snyder, F. P. Roth, A. L. Barabási, J. Tavernier, D. E. Hill, and M. Vidal, “High-quality binary protein interaction map of the yeast interactome network,” Science, vol. 322, no. 5898, pp. 104–110, Oct. 2008. doi: 10.1126/science.1158684
    Z. H. You, J. Z. Yu, L. Zhu, S. Li, and Z. K. Wen, “A MapReduce based parallel SVM for large-scale predicting protein-protein interactions,” Neurocomputing, vol. 145, pp. 37–43, Dec. 2014. doi: 10.1016/j.neucom.2014.05.072
    L. Hu, X. H. Yuan, P. W. Hu, and K. C. C. Chan, “Efficiently predicting large-scale protein-protein interactions using MapReduce,” Comput. Biol. Chem., vol. 69, pp. 202–206, Aug. 2017. doi: 10.1016/j.compbiolchem.2017.03.009
    J. G. Chen, K. L. Li, K. Bilal, A. A. Metwally, K. Q. Li, and P. Yu, “Parallel protein community detection in large-scale PPI networks based on multi-source learning,” IEEE/ACM Trans. Comput. Biol. Bioinform., 2018. DOI: 10.1109/TCBB.2018.2868088
    J. Bi, H. T. Yuan, and M. C. Zhou, “Temporal prediction of multiapplication consolidated workloads in distributed clouds,” IEEE Trans. Autom. Sci. Eng., vol. 16, no. 4, pp. 1763–1773, Oct. 2019. doi: 10.1109/TASE.2019.2895801
    J. Bi, H. T. Yuan, L. B. Zhang, and J. Zhang, “SGW-SCN: An integrated machine learning approach for workload forecasting in geo-distributed cloud data centers,” Inform. Sci., vol. 481, pp. 57–68, May 2019. doi: 10.1016/j.ins.2018.12.027
    G. D. Kritikos, C. Moschopoulos, M. Vazirgiannis, and S. Kossida, “Noise reduction in protein-protein interaction graphs by the implementation of a novel weighting scheme,” BMC Bioinformatics, vol. 12, Article No. 239, Jun. 2011. doi: 10.1186/1471-2105-12-239
    X. Luo, Z. Ming, Z. H. You, S. Li, Y. N. Xia, and H. Leung, “Improving network topology-based protein interactome mapping via collaborative filtering,” Knowl.-Based Syst., vol. 90, pp. 23–32, Dec. 2015. doi: 10.1016/j.knosys.2015.10.003
    B. Liu, K. M. Huang, J. Q. Li, and M. C. Zhou, “An incremental and distributed inference method for large-scale ontologies based on mapreduce paradigm,” IEEE Trans. Cybern., vol. 45, no. 1, pp. 53–64, Jan. 2015. doi: 10.1109/TCYB.2014.2318898
    M. S. Shang, X. Luo, Z. G. Liu, J. Chen, Y. Yuan, and M. C. Zhou, “Randomized latent factor model for high-dimensional and sparse matrices from industrial applications,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 1, pp. 131–141, Jan. 2019. doi: 10.1109/JAS.2018.7511189
    H. Zahid, T. Mahmood, A. Morshed, and T. Sellis, “Big data analytics in telecommunications: Literature review and architecture recommendations,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 1, pp. 18–38, Jan. 2020.
    X. Y. Shi, Q. He, X. Luo, Y. N. Bai, and M. S. Shang, “Large-scale and scalable latent factor analysis via distributed alternative stochastic gradient descent for recommender systems,” IEEE Trans. Big Data, 2020. DOI: 10.1109/TBDATA.2020.2973141
    T. White, Hadoop: The Definitive Guide. 3rd ed. Sebastopol, CA: O’Reilly Media, Inc., 2012.
    M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing, Boston, MA, 2010.
    L. Z. Shi, X. D. Meng, E. Tseng, M. Mascagni, and Z. Wang, “SpaRC: Scalable sequence clustering using Apache Spark,” Bioinformatics, vol. 35, no. 5, pp. 760–768, Mar. 2019.
    L. Hu and S. C. Yang, “A fast algorithm to identify coevolutionary patterns from protein sequences based on tree-based data structure,” in Proc. IEEE Int. Conf. Systems, Man and Cybernetics (SMC), Bari, Italy, 2019, pp. 2273–2278.
    A. Franceschini, D. Szklarczyk, S. Frankild, M. Kuhn, M. Simonovic, A. Roth, J. Y. Lin, P. Minguez, P. Bork, C. Von Mering, and L. J. Jensen, “String v9.1: Protein-protein interaction networks, with increased coverage and integration,” Nucleic Acids Res., vol. 41, pp. D808–D815, Nov. 2012. doi: 10.1093/nar/gks1094
    G. M. Wang, J. F. Qiao, J. Bi, W. J. Li, and M. C. Zhou, “TL-GDBN: Growing deep belief network with transfer learning,” IEEE Trans. Autom. Sci. Eng., vol. 16, no. 2, pp. 874–885, Apr. 2019. doi: 10.1109/TASE.2018.2865663
    Y. Cao and J. Huang, “Neural-network-based nonlinear model predictive tracking control of a pneumatic muscle actuator-driven exoskeleton,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 6, pp. 1478–1488, Nov. 2020. doi: 10.1109/JAS.2020.1003351
    G. M. Wang, Q. S. Jia, J. F. Qiao, J. Bi, and C. X. Liu, “A sparse deep belief network with efficient fuzzy learning framework,” Neural Netw., vol. 121, pp. 430–440, Jan. 2020. doi: 10.1016/j.neunet.2019.09.035
    R. M. Li, Y. F. Huang, and J. Wang, “Long-term traffic volume prediction based on k-means Gaussian interval type-2 fuzzy sets,” IEEE/CAA J. Autom. Sinica, vol. 6, no. 6, pp. 1344–1351, Nov. 2019.
    L. Hu, K. C. C. Chan, X. H. Yuan, and S. W. Xiong, “A variational Bayesian framework for cluster analysis in a complex network,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 11, pp. 2115–2128, Nov. 2020. doi: 10.1109/TKDE.2019.2914200
    X. P. Xu, J. Li, M. C. Zhou, J. Xu, and J. D. Cao, “Accelerated two-stage particle swarm optimization for clustering not-well-separated data,” IEEE Trans. Systems,Man,and Cybernetics:Systems, vol. 50, no. 11, pp. 4212–4223, Nov. 2020.
    C. Wang, W. Pedrycz, Z. W. Li, and M. C. Zhou, “Residual-driven fuzzy C-means clustering for image segmentation,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 4, pp. 876–889, Apr. 2021.
    X. S. Lu, M. C. Zhou, L. Qi and H. Y. Liu, “Clustering algorithm-based analysis of rare event evolution via social media data,” IEEE Trans. Computational Social Systems, vol. 6, no. 2, pp. 301–310, Apr. 2019.


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(4)

    Article Metrics

    Article views (929) PDF downloads(73) Cited by()


    • In this paper, a distributed framework is presented to reimplement one of state-of-the-art algorithms with MapReuce such that it can be applied for large-scale protein-protein interaction prediction
    • Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy
    • the upper limit of the proposed framework efficiency exists, as it is impossible to reduce its running time by simply increasing the number of computing nodes. We note that when the number of computing nodes exceeds some threshold, the process of data transfer takes more time than the computation, and thus constrains the further improvement of efficiency


    DownLoad:  Full-Size Img  PowerPoint