Social media is one of the uses of crowdsourcing to gather vast amounts of information. The applications of incident detection using social media data are commonly focus on text analysis. Due to the ability of social media to capture variant data types such as text, voice, image, or video, the development of incident detection based on multimodal data is preferred. The use of multimodal data on incident detection is expected to improve the accuracy of prediction. This research aims to detect emergency incidents based on multimodal data streams from social media using deep learning methods. We compare several deep learning architectures that implement some neural network variants, namely Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). We crawled data from Twitter API and labeled into three incident categories i.e. flood, traffic jam, and wildfire. The CNN and C-LSTM are used for text prediction, and the best performance obtained by C-LSTM and achieved 99.09% in the accuracy. The compared CNN models for image prediction are AlexNet, VGG16, VGG19, and SqueezeNet. The best performance obtained by VGG16 with data augmentation and achieved 99.08% in the accuracy. The incident detection results of multimodal data are obtained from the highest confidence level of text or image.