Friday, September 30, 2011

Urdu Text Archive

Urdu Text Archive is a collection of texts from different online sources of Urdu Unicode data. The aim was to collect raw text as first step towards creating an Urdu corpus. As there is no viable grammatical tagger for Urdu available yet, this data can only be used to create word lists or for computational linguistic purposes. The data is not sampled very well because we had least choice in data collection. The biggest source is news reports of various online Urdu news papers. The data was retrieved and freed from html programmatically using regex or html parsers in C#. The genres are typical to news domain. There are articles, news reports, columns, letters to editor, editorials etc. Another major genre is books. The books have been collected from Ejaz Akhtar’s copy left archive of open source Urdu Books. The books have data from various fields of life e.g. Poetry, Literature, Religion, Quarnic Translations, Ahadith and online editions of some magazines. The data from books section is proofread and in good quality. It can be used right away for any further development. But data from Urdu News Papers and online resources, is not so good. It is raw text with spelling errors which needs review. There are problems of space, sometimes additional spaces are introduced, other times spaced are omitted and the spellings are non standard in some cases. So a full sweep out of such error is required before creating any further resource based on this data. In present condition, the limitation of genres is also a problem for this data. The scope of data collection needs to be expanded from news papers to other fields of life as well.

Availability

The data of Urdu Text Archive can be downloaded from this link of 4share.com. The sources used for this data collection are Jang Online Edition, Nawaye Waq Online Edition, NewsUrdu.com news reports and Articles, Karachiupdates.com, AalamiAkhbar Online Newspaper and Urdu Books Archive by Ejaz Akhter. The estimate of words in this archive is 1,82,43,149 words approximately, which will be grown in future hopefully.

Future Plans

  • Expanding the archive e.g. Jasarat Online is a good source.
  • Adding more genres e.g. Urdu Chat Data, Urdu Forums Discussions after anonymization, Urdu Blogs.
  • Development of Urdu resources based on this data e.g. spell checking words lists for Urdu.
  • Refining the data and eliminating formatting. language and spelling problems.
  • Finding trends of Urdu writers on Internet.

3 comments:

Amir kamran said...

This is a nice effort but can you tell me about the copyrights of the data you shared ? If I use this data for research can I publish it with the results. 

Muhammad Shakir Aziz said...

Thanks for comment Amir. This is an individual effort with no financial support from any organization, govt. or person. Additionally the sources from which data is collected would be most likely not willing to grant it without handsome money. So it will have to be used without explicit reference to any particular source. At the moment this was what could be done. In future, an attempt to get permission might be done for a proper and small corpus.

Shakir said...

thank you man ... i am building an opensource roman urdu - urdu transliteration software ... this will help me allot