Getting Started to impact Machine Learning

Data Acquisition Process: What kind of datasets are required for different tasks? Data acquisition strategy would be defined by the type of architecture we want to use. We need to think through if the data has any inherent strategic bias. Most of the available datasets have been prepared and curated according to particular tasks. So how can we make them of use for us?

I am assigned to the translation part. The first thought in my mind was- will I have to scrape data from Wikipedia, thesaurus, Dictionaries, etc. I searched for translation APIs available online and found over 50 APIs providing translation services. After looking at existing academic literature in this domain, we get an idea of how the landscape is evolving. I realized that there are enough datasets available for some of the Indian Languages which would be enough for our primary task. For Example, a research group in IIT Bombay created Monolingual Corpus for Marathi containing 27 million tokens. Jawaid has curated 95 million tokens of Urdu. 11 million tokens are available for Telugu on wiki dump. Apart from that, there is pre-existing work with trained models and available weights that can directly be used for translation and transliteration task, or otherwise, as transfer learning.

This post was about the thought process while getting started and thinking to the end. Next post would talk about the Literature Survey and links in detail and I will try to make it a one-stop solution for Indic languages translation and transliteration.
Disclaimer: I will be solely liable for anything wrong related to this post.

