Gender Detection in Blogs
Through sands of time, textual content has remained a prominent feature of internet media. Blog writing has become a large spread hobby of a lot of population, irrespective of their age. Popular social text writing applications are Twitter, Facebook, emails, chat rooms, blogs, etc. With a lot of content comes lot of responsibility, and since the internet cannot take responsibility of the all the content, it should be the author itself. Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender. This project thus aims on the profiling of blogs on the basis of gender of the author.
Dataset Credits
Dataset used is Koppels Blog Dataset containing 19320 blog instances.
Source - http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
Main System
The main system consists of:
- Feature Extraction from the blogs.
- Training the feature set on different classifier.
- Building up of an Ensemble model on the different classifiers.
A more sophisticated model of our system is represented out by flowchart below:
Feature Extraction
The following set of features were used in the feature set for the task of classifying:
- Character Based Features
- Word Based Features
- Syntactic Features
- Structural Features
- Function Words
- POS Start Probability
Classification
The following set of classifiers are used for the task of classification:
- Random Forest Classifier - Scikit Learn Python
- Multilayer Perceptron Neural Networks Classifier - Scikit Learn Python
- Adaboost Tree Classifier - Scikit Learn Python
- Gradient Boosting Classifier - Scikit Learn Python
- Bagging Classifier - Scikit Learn Python
Predicting the output
For the prediction of output we have taken the Ensemble (i.e. the majority voting) from the Classifiers above.
Results
For now the tool works with an accuracy of around 73% for the Koppels Blog Dataset.
Future Improvements
- Extraction of semantic and grammatical information of the text such as depth of dependency trees, depth of noun phrases, adjectival phrases, etc.
- Usage of more advanced Classifiers in the Classification tasks such as Recurrent Neural Networks, Google's Tensor Flow etc.
- Entity/topic linking with the context of the blogs, Eg. A blog speaking of cricket is more likely to be written by a male.
- Extending the tool for multiple Languages.
Authors and Contributors
This tool was made by Nitish Jain (@nitishjain2007), Ganesh Borle (@buggynap), and Vamshikrishna Reddy (@vamshi177) from International Institute of Information Technology, Hyderabad as a part of Information Retrieval and Extraction course project.
Support or Contact
Having trouble with Pages? Check out our video or our presentation or raise an issue and we’ll help you sort it out.
Tags
Information Retrieval and Extraction Course
Gender Detection
Blogs
Classification
IIIT-H
Major Project
NLP
Machine Learning
Random Forest
Author Profiling