SMS Spam Collection
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    SMS Spam Collection

    This is a text corpus of over 5,500 English SMS messages with ~13% labeled as spam. The text file contains one message per line with two columns: the label ("ham" or "spam") and the raw text of the message. Messages labeled as "ham" are non-spam messages that can be considered legitimate.

    Not sure where to begin? Scroll to the bottom to find challenges!

    import pandas as pd
    
    pd.read_csv("SMSSpamCollection.csv", header=None)

    Source of dataset. This corpus was created by Tiago A. Almeida and José María Gómez Hidalgo.

    Citations:

    • Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.

    • Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection. Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012.

    • Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P. Towards SMS Spam Filtering: Results under a New Dataset. International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013.

    Don't know where to start?

    Challenges are brief tasks designed to help you practice specific skills:

    • 🗺️ Explore: What are the most common words in spam versus normal messages?
    • 📊 Visualize: Create a word cloud visualizing the most common words in the dataset.
    • 🔎 Analyze: What word is most likely to indicate that a message is spam?

    Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

    You work for a telecom company that is launching a new messaging app. Unfortunately, previous spam filters that they have used are out of date and no longer effective. They have asked you whether you can use new data they have supplied to distinguish between spam and regular messages accurately. They have also told you that it is essential that regular messages are rarely, if ever, categorized as spam.

    You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.


    ✍️ If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system.