A journal of IEEE and CAA , publishes high-quality papers in English on original theoretical/experimental research and development in all areas of automation
Volume 8 Issue 5
May  2021

IEEE/CAA Journal of Automatica Sinica

  • JCR Impact Factor: 11.8, Top 4% (SCI Q1)
    CiteScore: 17.6, Top 3% (Q1)
    Google Scholar h5-index: 77, TOP 5
Turn off MathJax
Article Contents
J. P. Li, Y. L. Tao, and T. Cai, "Predicting Lung Cancers Using Epidemiological Data: A Generative-Discriminative Framework," IEEE/CAA J. Autom. Sinica, vol. 8, no. 5, pp. 1067-1078, May. 2021. doi: 10.1109/JAS.2021.1003910
Citation: J. P. Li, Y. L. Tao, and T. Cai, "Predicting Lung Cancers Using Epidemiological Data: A Generative-Discriminative Framework," IEEE/CAA J. Autom. Sinica, vol. 8, no. 5, pp. 1067-1078, May. 2021. doi: 10.1109/JAS.2021.1003910

Predicting Lung Cancers Using Epidemiological Data: A Generative-Discriminative Framework

doi: 10.1109/JAS.2021.1003910
Funds:  This work was supported in part by Zhejiang Provincial Natural Science Foundation of China (LQ20F030013), Research Foundation of HwaMei Hospital, University of Chinese Academy of Sciences (2020HMZD22), Ningbo Public Service Technology Foundation (202002N3181), and Medical Scientific Research Foundation of Zhejiang Province (2021431314)
More Information
  • Predictive models for assessing the risk of developing lung cancers can help identify high-risk individuals with the aim of recommending further screening and early intervention. To facilitate pre-hospital self-assessments, some studies have exploited predictive models trained on non-clinical data (e.g., smoking status and family history). The performance of these models is limited due to not considering clinical data (e.g., blood test and medical imaging results). Deep learning has shown the potential in processing complex data that combine both clinical and non-clinical information. However, predicting lung cancers remains difficult due to the severe lack of positive samples among follow-ups. To tackle this problem, this paper presents a generative-discriminative framework for improving the ability of deep learning models to generalize. According to the proposed framework, two nonlinear generative models, one based on the generative adversarial network and another on the variational autoencoder, are used to synthesize auxiliary positive samples for the training set. Then, several discriminative models, including a deep neural network (DNN), are used to assess the lung cancer risk based on a comprehensive list of risk factors. The framework was evaluated on over 55 000 subjects questioned between January 2014 and December 2017, with 699 subjects being clinically diagnosed with lung cancer between January 2014 and August 2019. According to the results, the best performing predictive model built using the proposed framework was based on DNN. It achieved an average sensitivity of 76.54% and an area under the curve of 69.24% in distinguishing between the cases of lung cancer and normal cases on test sets.


  • loading
  • [1]
    L. A. Torre, R. L. Siegel, and A. Jemal, “Lung cancer statistics,” in Lung Cancer and Personalized Medicine. Switzerland: Springer, 2016, pp. 1–19.
    W. Q. Chen, R. S. Zheng, P. D. Baade, S. W. Zhang, H. M. Zeng, F. Bray, A. Jemal, X. Q. Yu, and J. He, “Cancer statistics in China, 2015,” CA Cancer J. Clin., vol. 66, no. 2, pp. 115–132, Mar.–Apr. 2016. doi: 10.3322/caac.21338
    C. I. Henschke, D. I. McCauley, D. F. Yankelevitz, D. P. Naidich, G. McGuinness, O. S. Miettinen, D. M. Libby, M. W. Pasmantier, J. Koizumi, N. K. Altorki, and J. P. Smith, “Early Lung Cancer Action Project: Overall design and findings from baseline screening,” Lancet, vol. 354, no. 9173, pp. 99–105, Jul. 1999. doi: 10.1016/S0140-6736(99)06093-6
    G. A. Colditz, K. A. Atwood, K. Emmons, R. R. Monson, W. C. Willett, D. Trichopoulos, and D. J. Hunter, “Harvard report on cancer prevention volume 4: Harvard cancer risk index,” Cancer Causes Control, vol. 11, no. 6, pp. 477–488, Jul. 2000. doi: 10.1023/A:1008984432272
    E. P. Gray, M. D. Teare, J. Stevens, and R. Archer, “Risk prediction models for lung cancer: A systematic review,” Clinical Lung Cancer, vol. 17, no. 2, pp. 95–106, Mar. 2016. doi: 10.1016/j.cllc.2015.11.007
    Q. Y. Hong, G. M. Wu, G. S. Qian, C. P. Hu, J. Y. Zhou, L. A. Chen, W. M. Li, S. Y. Li, K. Wang, Q. Wang, X. J. Zhang, J. Li, X. Gong, and C. X. Bai, “Prevention and management of lung cancer in China,” Cancer, vol. 121, no. S17, pp. 3080–3088, Sep. 2015. doi: 10.1002/cncr.29584
    M. Kordestani, M. Saif, M. E. Orchard, R. Razavi-Far, and K. Khorasani, “Failure prognosis and applications: A survey of recent literature,” IEEE Trans. Reliab., 2019, DOI: 10.1109/TR.2019.2930195.
    S. Park, B. H. Nam, H. R. Yang, J. A. Lee, H. Lim, J. T. Han, I. S. Park, H. R. Shin, and J. S. Lee, “Individualized risk prediction model for lung cancer in Korean men,” PLoS One, vol. 8, no. 2, Article No. e54823, Feb. 2013. doi: 10.1371/journal.pone.0054823
    N. Horeweg, J. van Rosmalen, M. A. Heuvelmans, C. M. van der Aalst, R. Vliegenthart, E. T. Scholten, K. ten Haaf, K. Nackaerts, J. W. J. Lammers, C. Weenink, H. J. Groen, P. van Ooijen, P. A. de Jong, G. H. de Bock, W. Mali, H. J. de Koning, and M. Oudkerk, “Lung cancer probability in patients with CT-detected pulmonary nodules: A prespecified analysis of data from the NELSON trial of low-dose CT screening,” Lancet Oncol., vol. 15, no. 12, pp. 1332–1341, Nov. 2014. doi: 10.1016/S1470-2045(14)70389-4
    N. Emaminejad, W. Qian, Y. B. Guan, M. Tan, Y. C. Qiu, H. Liu, and B. Zheng, “Fusion of quantitative image and genomic biomarkers to improve prognosis assessment of early stage lung cancer patients,” IEEE Trans. Biomed. Eng., vol. 63, no. 5, pp. 1034–1043, May 2016. doi: 10.1109/TBME.2015.2477688
    P. B. Bach, M. W. Kattan, M. D. Thornquist, M. G. Kris, R. C. Tate, M. J. Barnett, L. J. Hsieh, and C. B. Begg, “Variations in lung cancer risk among smokers,” J. Natl. Cancer Inst., vol. 95, no. 6, pp. 470–478, Mar. 2003. doi: 10.1093/jnci/95.6.470
    M. R. Spitz, W. K. Hong, C. I. Amos, X. F. Wu, M. B. Schabath, Q. Dong, S. Shete, and C. J. Etzel, “A risk model for prediction of lung cancer,” J. Natl. Cancer Inst., vol. 99, no. 9, pp. 715–726, May 2007. doi: 10.1093/jnci/djk153
    K. A. Cronin, M. H. Gail, A. H. Zou, P. B. Bach, J. Virtamo, and D. Albanes, “Validation of a model of lung cancer risk prediction among smokers,” J. Natl. Cancer Inst., vol. 98, no. 9, pp. 637–640, May 2006. doi: 10.1093/jnci/djj163
    M. C. Tammemagi, S. C. Lam, A. M. McWilliams, and D. D. Sin, “Incremental value of pulmonary function and sputum DNA image cytometry in lung cancer risk prediction,” Cancer Prev. Res., vol. 4, no. 4, pp. 552–561, Apr. 2011. doi: 10.1158/1940-6207.CAPR-10-0183
    M. C. Tammemagi, H. A. Katki, W. G. Hocking, T. R. Church, N. Caporaso, P. A. Kvale, A. K. Chaturvedi, G. A. Silvestri, T. L. Riley, J. Commins, and C. D. Berg, “Selection criteria for lung-cancer screening,” N. Engl. J. Med., vol. 368, no. 8, pp. 728–736, Feb. 2013. doi: 10.1056/NEJMoa1211776
    C. Hoggart, P. Brennan, A. Tjonneland, U. Vogel, K. Overvad, J. N. Ostergaard, R. Kaaks, F. Canzian, H. Boeing, A. Steffen, A. Trichopoulou, C. Bamia, D. Trichopoulos, M. Johansson, D. Palli, V. Krogh, R. Tumino, C. Sacerdote, S. Panico, H. Boshuizen, H. B. Bueno-de-Mesquita, P. H. M. Peeters, E. Lund, I. T. Gram, T. Braaten, L. Rodriguez, A. Agudo, E. Sanchez-Cantalejo, L. Arriola, M. D. Chirlaque, A. Barricarte, T. Rasmuson, K. T. Khaw, N. Wareham, N. E. Allen, E. Riboli, and P. Vineis, “A risk model for lung cancer incidence,” Cancer Prev. Res., vol. 5, no. 6, pp. 834–846, Jun. 2012. doi: 10.1158/1940-6207.CAPR-11-0237
    G. R. Hart, D. A. Roffman, R. Decker, and J. Deng, “A multi-parameterized artificial neural network for lung cancer risk prediction,” PLoS One, vol. 13, no. 10, Article No. e0205264, Oct. 2018. doi: 10.1371/journal.pone.0205264
    C. J. Etzel, S. Kachroo, M. Liu, A. D’Amelio, Q. Dong, M. L. Cote, A. S. Wenzlaff, W. K. Hong, A. J. Greisinger, A. G. Schwartz, and M. R. Spitz, “Development and validation of a lung cancer risk prediction model for African-Americans,” Cancer Prev. Res., vol. 1, no. 4, pp. 255–265, Sep. 2008. doi: 10.1158/1940-6207.CAPR-08-0082
    A. Cassidy, J. P. Myles, M. van Tongeren, R. D. Page, T. Liloglou, S. W. Duffy, and J. K. Field, “The LLP risk model: An individual risk prediction model for lung cancer,” Br. J. Cancer, vol. 98, no. 2, pp. 270–276, Jan. 2008. doi: 10.1038/sj.bjc.6604158
    R. Longadge and S. Dongre, Class imbalance problem in data mining review. 2013. [Online]. Available: arXiv:1305.1707
    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. 27th Int. Conf. Neural Information Processing Systems, Cambridge, MA, USA, 2014, pp. 2672–2680.
    M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in Proc. 34th Int. Conf. Machine Learning, Sydney, NSW, Australia, 2017, pp. 214–223.
    K. Shmelkov, C. Schmid, and K. Alahari, “How good is my GAN?,” in Proc. European Conf. Computer Vision (ECCV), Munich, Germany, 2018, pp. 218–234.
    C. Doersch, Tutorial on variational autoencoders. 2016. [Online]. Available: arXiv:1606.05908
    D. P. Kingma and M. Welling, Auto-encoding variational bayes. 2013. [Online]. Available: arXiv:1312.6114
    W. Muhammad, G. R. Hart, B. Nartowt, J. J. Farrell, K. Johung, Y. Liang, and J. Deng, “Pancreatic cancer prediction through an artificial neural network,” Front. Artif. Intell., vol. 2, Article No. 2, May 2019. doi: 10.3389/frai.2019.00002
    D. P. Kingma and J. Ba, Adam: A method for stochastic optimization. 2014. [Online]. Available: arXiv:1412.6980
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.
    G. M. Weiss and F. Provost, “The effect of class distribution on classifier learning: An empirical study,” Technical Report ML-TR-44, 2001.
    G. Lemaitre, F. Nogueira, and C. K. Aridas, “Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 559–563, Jan. 2017.
    Y. Z. Liu, Z. Li, C. Zhou, Y. C. Jiang, J. S. Sun, M. Wang, and X. N. He, “Generative adversarial active learning for unsupervised outlier detection,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 8, pp. 1517–1528, Aug. 2020.
    W. C. Lin, C. F. Tsai, Y. H. Hu, and J. S. Jhang, “Clustering-based undersampling in class-imbalanced data,” Inform. Sci., vol. 409-410, pp. 17–26, Oct. 2017. doi: 10.1016/j.ins.2017.05.008
    Z. P. Che, Y. Cheng, S. F. Zhai, Z. N. Sun, and Y. Liu, “Boosting deep learning risk prediction with generative adversarial networks for electronic health records,” in Proc. 2017 IEEE Int. Conf. Data Mining (ICDM), New Orleans, LA, USA, 2017, pp. 787–792.
    J. Didkowska, U. Wojciechowska, M. Manzuk, and J. Lobaszewski, “Lung cancer epidemiology: Contemporary and future challenges worldwide,” Ann. Transl. Med., vol. 4, no. 8, Article No. 150, Apr. 2016. doi: 10.21037/atm.2016.03.11
    Q. J. Lu, S. X. Yao, C. Y. Huang, Y. J. Lan, S. Y. Cang, P. R. Taylor, Y. L. Qiao, and R. S. Wang, “A cohort study on the relationship between vegetable intake and risks of lung cancer in the Tin Corporation (YTC) miners in Yunnan,” Chin. J. Epidemiol., vol. 21, no. 3, pp. 205–207, Jun. 2000.
    G. Lippi, C. Mattiuzzi, and G. Cervellin, “Meat consumption and cancer risk: A critical review of published meta-analyses,” Crit. Rev. Oncol. Hematol., vol. 97, pp. 1–14, Jan. 2016. doi: 10.1016/j.critrevonc.2015.11.008
    M. Shen, R. S. Chapman, X. Z. He, L. Z. Liu, H. Lai, W. Chen, and Q. Lan, “Dietary factors, food contamination and lung cancer risk in Xuanwei, China,” Lung Cancer, vol. 61, no. 3, pp. 275–282, Sep. 2008. doi: 10.1016/j.lungcan.2007.12.024
    H. Y. Yu, Q. P. Xu, W. M. Xiong, Z. Q. Liu, L. Cai, and F. He, “Association of pickled food, fired food and smoked food combined with smoking and alcohol drinking with lung cancer: A case-control study,” J. Hyg. Res., vol. 48, no. 6, pp. 925–931, Nov. 2019.
    J. Jeon, T. R. Holford, D. T. Levy, E. J. Feuer, P. Cao, J. Tam, L. Clarke, J. Clarke, C. Y. Kong, and R. Meza, “Smoking and lung cancer mortality in the United States from 2015 to 2065: A comparative modeling approach,” Ann. Intern. Med., vol. 169, no. 10, pp. 684–693, Nov. 2018. doi: 10.7326/M18-1250
    X. N. Zou, M. M. Jia, X. Wang, and X. Y. Zhi, “Changing epidemic of lung cancer & tobacco and situation of tobacco control in China,” Chin. J. Lung Cancer, vol. 20, no. 8, pp. 505–510, Aug. 2017.
    G. R. Husebo, R. Nielsen, J. Hardie, P. S. Bakke, L. Lerner, C. D’Alessandro-Gabazza, J. Gyuris, E. Gabazza, P. Aukrust, and T. Eagan, “Risk factors for lung cancer in COPD-results from the Bergen COPD cohort study,” Respirat. Med., vol. 152, pp. 81–88, Jun. 2019. doi: 10.1016/j.rmed.2019.04.019


    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(4)  / Tables(6)

    Article Metrics

    Article views (1493) PDF downloads(48) Cited by()


    • A deep learning model is constructed to assess the risk of developing lung cancers for the Chinese people aged between 40 and 74.
    • Basic information, diet habits, living conditions, psychology and emotion, medical history and family cancer history are used to facilitate the pre-hospital self-assessment.
    • A generative adversarial network is used to synthesize auxiliary positive samples, and a discriminative model is used to predict the lung cancer.
    • An average sensitivity of 76.54% is achieved in distinguishing between the cases of lung cancer and normal cases on test sets.


    DownLoad:  Full-Size Img  PowerPoint