Gender Detection in Blogs

Through sands of time, textual content has remained a prominent feature of internet media. Blog writing has become a large spread hobby of a lot of population, irrespective of their age. Popular social text writing applications are Twitter, Facebook, e­mails, chat rooms, blogs, etc. With a lot of content comes lot of responsibility, and since the internet cannot take responsibility of the all the content, it should be the author itself. Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender. This project thus aims on the profiling of blogs on the basis of gender of the author.

Dataset Credits

Dataset used is Koppels Blog Dataset containing 19320 blog instances.
Source - http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

Main System

The main system consists of:

  1. Feature Extraction from the blogs.
  2. Training the feature set on different classifier.
  3. Building up of an Ensemble model on the different classifiers.

A more sophisticated model of our system is represented out by flowchart below:

Flowchart

Feature Extraction

The following set of features were used in the feature set for the task of classifying:

Classification

The following set of classifiers are used for the task of classification:

Predicting the output

For the prediction of output we have taken the Ensemble (i.e. the majority voting) from the Classifiers above.

Results

For now the tool works with an accuracy of around 73% for the Koppels Blog Dataset.

Future Improvements

Authors and Contributors

This tool was made by Nitish Jain (@nitishjain2007), Ganesh Borle (@buggynap), and Vamshikrishna Reddy (@vamshi177) from International Institute of Information Technology, Hyderabad as a part of Information Retrieval and Extraction course project.

Support or Contact

Having trouble with Pages? Check out our video or our presentation or raise an issue and we’ll help you sort it out.

Tags

Information Retrieval and Extraction Course Gender Detection Blogs Classification IIIT-H Major Project NLP Machine Learning Random Forest Author Profiling