Fast Violence Recognition in Video Surveillance by Integrating Object Detection and Conv-LSTM

The aim of this article is to develop an intelligent system capable of analyzing long sequences of videos captured from CCTV, helping to mitigate catastrophe and mitigate the violent threats faced by citizens every day, economically and efficiently.

Authors

Nikita Jain, Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India

Vedika Gupta, Assistant Professor, Jindal Global Business School, O.P. Jindal Global University, Sonipat, Haryana, India.

Usman Tariq, Management Information System Department, College of Business Administration, Prince Sattam Bin Abdulaziz University, Al-Kharj, 16278, Saudi Arabia

D. Jude Hemanth, Department of Electronics & Communication Engineering, Karunya University, Coimbatore, India.

Summary

Video surveillance involves petabytes of data storage requiring expensive hardware, which might also be time-inefficient. The aim of this article is, therefore, to develop an intelligent system capable of analyzing long sequences of videos captured from CCTV, helping to mitigate catastrophe and mitigate the violent threats faced by citizens every day, economically and efficiently.

Existing models have achieved high accuracy on available datasets, the primary focus is to improve speed (time-efficient) of violence detection and use very little storage (economical) such that the system can be used in real-time. The paper presents an end-to-end hybrid solution for detecting violence in real-time video frames incorporating both human and weapon detection algorithms applied in a synchronized way.

The focus of this article is to propose a generic HWVd (Human Weapon Violence detection) model to detect all kinds of public violence. HWVd is a three-tier ensemble model to detect violence in videos. The first tier is human detection, which uses a LightTrack framework. In the second tier, a Fast Region-based Convolutional Neural Network (F-RCNN) to detect any weapon in videos is used.

The third tier uses a pre-trained VGG 19 (a pre-trained model of CNN) for spatial feature extraction and Long Short Term Memory (LSTM) to detect violent activity. Lastly, the output of this framework is sent to the Support Vector Machine to classify the activity as (i) violence not involving weapon, (ii) violence involving weapon and (iii) non-violent. The accuracy obtained using the proposed model is 98%.

Published in: International Journal on Artificial Intelligence Tools

To read the full article, please click here.

Staff

CATEGORIES

RECENT POSTS

CONTACT US