Hi there,

In this series of blogs, We will go throw the development process of The Arabic extension of Textblob.

This series will focus on the mathematical and scientifical background. In addition to explaining some of the models and algorithms used to build up the package.

Every blog in this theory will either explain the mathematics of a set of algorithm(s) and model(s) or development oriented and focused on python and Textblob package.

What’s this all about?

These series are meant to the project supervisors, development team, new contributors and anyone interested in learning about Arabic natural language processing.

Textblob

Textblob as described by its maintainers;

A Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Until the time of writing this blog, Textblob has around 6k stars, 772 forks and 20 contributors.

It’s featured in a lot of articles as one of the best 5 python packages in NLP.

Why Textblob and not a separate project?

The aim of building this project is to have a trusted tool for arabic NLP tasks. And the trade off between starting a new project from scratch and building the tool as an extension for a well-known tool was big.

Also, I didn’t want to Rebuild the wheel and Textblob gives a good infrastructure for adding new algorithms and models.

Another point that the project is still in its early days and It’s better to focus on building the real models and algorithms then think about the name and the structure -which can change easily-.


Arabic NLP

The gap between importing and using an English model to do a simple NLP task and an Arabic one is very wide.

For example, if you’re trying to build an English chatbot. You will find a package for every task included. And for the same task you will find multiple tools that you can choose between them although they are all good.

While if you want to build the same model for Arabic. You will barely find good tools. Despite you will find a lot of papers.

So the goal of this tool is very defined; we want to build a universal tool that helps Engineers in building Arabic-supported applications.


The first version

For my 2019 graduation project, the goal is to deliver the first version of the project which includes the most important algorithms and models. And to provide the core infrastructure to build more complex models.

Including but not limited to:

  • Tokenization.
  • Stemming.
  • Part of speech tagging.
  • Noun-phrase extraction.
  • Sentiment analysis.
  • Spelling correction.
  • Text similarity.

Philosophy

The philosophy of the project is to be minimal, opinionated and straightforward.

The project might be part of Textblob but really inspired by SpaCy and it’s quality and speed.


This is part I of the series GP Docs