Linguistics Learner: Corpus Linguistics

Showing posts with label Corpus Linguistics. Show all posts

Friday, August 14, 2020

Resources for Forensic Linguistics

Forensic linguistics is a branch of linguistics that is concerned with (mostly) criminal investigations and the use of linguistic analysis in this process. For example, a very simple task of a forensic linguist could be to analyse the linguistic content of a threat message and try to find out the writer of this text. A linguist who is trained in text analysis and stylistics will find themselves equipped with such tools that would be helpful in this analysis. Today, corpus linguistics is heavily used in text analysis for forensic linguistics. Moreover, forensic linguistics is not just about investigating threat messages or suicide notes, there are many more opportunities with the new mediums of communication on the internet, e.g. the analysis of social media texts for deception, fake news, disinformation, and trolling.

I have come to know some people and resources in last few years in this area. Following them on social media (Twitter) will be helpful for the reader to widen their horizons and/or get more information about forensic linguistic methods.

Dr Claire Hardaker (she also runs a podcast on forensic linguistics which is very informative, and she is the director of FORGE: Forensic Linguistics RG, University of Lancaster).
William Dance is doing PhD in online fake news.
Isobelle Clarke is a corpus linguist who has worked on Trump's tweets.
Maciej Eder is a stylistician and digital humanist. He has written an R package called Stylo which is helpful in text analysis, e.g. authorship attribution.

As far as books related to forensic linguistics are concerned, a simple Google search with 'forensic linguistics' can return many titles including introductory books and handbooks about forensic linguistics. PDFs for many books can be found on Library Genesis (search on Google or use DuckDuckGo Search Engine that is a bit more merciful in searching such websites).

Thursday, August 30, 2018

Understanding Multidimensional Analysis

Multidimensional Analysis is a quantitative framework to study register variation developed by Douglas Biber. These slides present a very basic introduction of this methodology and its assumptions.

Friday, September 30, 2011

Urdu Text Archive

Urdu Text Archive is a collection of texts from different online sources of Urdu Unicode data. The aim was to collect raw text as first step towards creating an Urdu corpus. As there is no viable grammatical tagger for Urdu available yet, this data can only be used to create word lists or for computational linguistic purposes. The data is not sampled very well because we had least choice in data collection. The biggest source is news reports of various online Urdu news papers. The data was retrieved and freed from html programmatically using regex or html parsers in C#. The genres are typical to news domain. There are articles, news reports, columns, letters to editor, editorials etc. Another major genre is books. The books have been collected from Ejaz Akhtar’s copy left archive of open source Urdu Books. The books have data from various fields of life e.g. Poetry, Literature, Religion, Quarnic Translations, Ahadith and online editions of some magazines. The data from books section is proofread and in good quality. It can be used right away for any further development. But data from Urdu News Papers and online resources, is not so good. It is raw text with spelling errors which needs review. There are problems of space, sometimes additional spaces are introduced, other times spaced are omitted and the spellings are non standard in some cases. So a full sweep out of such error is required before creating any further resource based on this data. In present condition, the limitation of genres is also a problem for this data. The scope of data collection needs to be expanded from news papers to other fields of life as well.

Availability

The data of Urdu Text Archive can be downloaded from this link of 4share.com. The sources used for this data collection are Jang Online Edition, Nawaye Waq Online Edition, NewsUrdu.com news reports and Articles, Karachiupdates.com, AalamiAkhbar Online Newspaper and Urdu Books Archive by Ejaz Akhter. The estimate of words in this archive is 1,82,43,149 words approximately, which will be grown in future hopefully.

Future Plans

Expanding the archive e.g. Jasarat Online is a good source.
Adding more genres e.g. Urdu Chat Data, Urdu Forums Discussions after anonymization, Urdu Blogs.
Development of Urdu resources based on this data e.g. spell checking words lists for Urdu.
Refining the data and eliminating formatting. language and spelling problems.
Finding trends of Urdu writers on Internet.

Thursday, September 22, 2011

Creating an Urdu Translation Glossary with C#

After a long time I have come back to C#. My limited skills allow me to do some things only with text processing and as I trained my self, my little silly looking C# scripts deal with file operations, getting word frequencies, getting counts from different text files etc etc. As I use C# as Python or other scripting language should be used, to write small programs to do tiny miny things, so I consider myself less of a programmer. But still I am in between a layman and a programmer, and can write a few bits of code. This time, after several months when I came back to C#, it was the urge to create an Urdu Glossary which can work with OmegaT. OmegaT is an open source translation memory tool, and I use it for most of my translations. And I need dictionary constantly as well, but the dictionary I use Urdu Web Lughat which is written in C# 2 and fits my purpose very well, when I add this Lughat File instead of default file which has only a few thousand words. This file has 92661 words and expressions which usually provide me insight for the correct translation, not to mention several other Urdu lughats have been merged with original few thousand words file to create this file. Well it serves the purpose, but it creates problem as well. I have to type each word in its lemma form i.e. if it is a verb in with third form e.g. got, I’ll have to write fist form get to get the meaning. Other problems are with the application.

Urdu Web Lughat

As the picture shows, capability to search the desired word from Google, Wikitionary, and Urdu Wikipedia as been added to it. This is done by default Internet Explorer Engine provided for C# 2005, but it creates problem while using in higher versions of Windows e.g. Windows 7. Each time I seek a word, it will show 2 or more dialogue boxes showing this error.

Urdu Web Lughat Error

I think I’ve the source code and can fix the problem by simply disabling this feature. But my other problem remains, that is typing the word each time, it takes too much time. So I was seeking for a solution. The solution I figured out was simple, put the lughat file in OmegaT somehow. And that’s possible through dictionary or glossary setup of the program. I find the dictionary process of OmegaT quite complicated (it requires Star Dict format). So I think its more feasible to put a glossary file instead (which is in this format word-tab-meaning). This was easy for me. I wrote a regular expression to match the lughat file’s xml and to extract each word meaning node (Don’t tell me it would have been easy with xml parser, I do not know it so regex was best for me). So here is the code I used to convert the nodes of xml in tab separated text file with each word meaning pair on new line.

public static void test ()
        {
            string match = "\\<I\\>\\s*\\<w\\>(?<Word>[^><]+)\\</w\\>\\s*\\<m\\>(?<Meaning>[^><]+)\\</m\\>\\s*\\</I\\>";
            string text = File.ReadAllText(@"G:\Software\Dictionary\dictionary.xml");
            Console.WriteLine(text.Length);
            //Console.Read();
            MatchCollection m = Regex.Matches(text, match);
            StreamWriter sw = new StreamWriter(@"G:\Software\Dictionary\Glossary.txt", true);
            string word = "";
            string meaning = "";
            Console.WriteLine(m.Count);
            //Console.Read();
            for(int i=0; i<m.Count; i++)
            {
                word = Regex.Replace(m[i].Groups["Word"].Value, "[\\r\\n]+", " ");
                meaning = Regex.Replace(m[i].Groups["Meaning"].Value, "[\\r\\n]+", " ");
                sw.WriteLine(word+"\t"+meaning);
                Console.WriteLine(word);
            }
            sw.Close();
        }

I know it could be improved a lot, but it did the work and did it pretty soon. I put the glossary file in OmegaT project folder as it was required (/project/glossary/Glossary.txt) and that’s done.

OmegaT with Glossary 1

I was able to see the word meaning for each sentence in the glossary box in lower right box. It was quite a help for me and reduced my time. But then came the other problem: I still had to go to dictionary and type the lemma forms of verbs and plural nouns, so there was no meaning available for worked, working, works but only for the 1^st form i.e. work. This was quite frustrating for me because the actual problem was still there. So I decided to add the other forms of the words which had more than one forms. I decided to add new lines in the glossary for each form of verb, and add lines for plural forms of nouns as well. The idea was simple enough, but to accomplish it I had to grab a lemma list. Fortunately, being a corpus linguistics student, I know one out on internet. E_Lemma.txt has been created for Word Smith Tools, but I used it for my purpose. The code I used to extract lemmas of each word, and add them to new glossary file is below.

public static void Main(string[] args)
        {
            string[] lines = File.ReadAllLines(@"G:\Software\Dictionary\Glossary.txt");
            string[] lemmas = File.ReadAllLines(@"F:\Corpus Related\e_lemma.txt");
            string word = "";
            string meaning = "";
            string lemma = "";
            StreamWriter sw = new StreamWriter(@"G:\Software\Dictionary\Glossary2.txt", true);
            int count = 1;
            foreach(string line in lines)
            {
                Console.WriteLine(count+" of "+lines.Length);
                word = Regex.Split(line, "\t")[0].Trim();
                meaning = Regex.Split(line, "\t")[1].Trim();
                lemma = giveLemma(word, lemmas);
                if(lemma!="")
                {
                    foreach(string lemma1 in lemma.Split(' '))
                    {
                        sw.WriteLine(lemma1+"\t"+meaning);
                        Console.WriteLine(lemma1+"\t"+meaning);
                    }
                }
                else
                {
                    sw.WriteLine(word+"\t"+meaning);
                    Console.WriteLine(word+"\t"+meaning);
                }
                count++;
            }
            sw.Close();
        }
        public static string giveLemma (string word, string[] lemmas)
        {
            string toReturn = "";
            string lemma1 = "";
            string lemma2 = "";
            foreach(string lemma in lemmas)
            {
                lemma1 = Regex.Split(lemma, "->")[0].Trim();
                lemma2 = Regex.Split(lemma, "->")[1].Trim();
                if(word==lemma1)
                {
                    toReturn += lemma1 + " " + Regex.Replace(lemma2, ",", " ");
                }
            }
            return toReturn;
        }

As it can be seen, the task is pretty simple.

Get each line from previously made glossary file.
Split the word and meaning.
Pass the word to another function, along with the lemma list (got from e_lemma.txt file) and return all possible forms of a verb or a noun.
At the end write a new line for “word-tab-meaning” pair for each lemma form of verb or noun.
And if there is no lemma, simply write the original word meaning pair.

The task was simple but the code was quite inefficient, so it took a long time to do this task. It took almost 1.5 hour to do it. But it was done and worked like a charm. See Smile

OmegaT with Glossary 2

So a glossary for Urdu Translators working with OmegaT is available. And of course it can be downloaded from here.

Urdu Glossary in Zip

Saturday, May 29, 2010

Regular Expressions: Facility, Ability and Haste

Regular Expression or regex is a powerful tool for text processing. People like me who encounter with text processing on daily basis know the ease and power they provide. Regex are full fledged language actually, a mini language with its own rules and very systematic and organized structure. Regex as they are known today are mostly borrowed from early days of Perl, that's why they are mostly called Perl Compatible Regular Expressions. No high level language can ever miss this most demanding feature in today's circumstances, and even the parent of many new languages C++ will have regex library in its new C++0x version/standard.
I came to know about regex 3 years ago when I was working with my teachers on their corpus research work. I was unable to grasp the meaning of regex initially and the power they had behind them. But after some time I got books and article on the topic and started learning them. The book most helpful for me was Friedl - Mastering Regular Expressions 3e (O'Reilly, 2006). I completed only 2 chapters of this book but it made me speedy panther in text processing from a lame lamb. Being a linguistics student and a corpus linguist I am always seeking ways to get text patterns automatically with least possible time. And regex provide me this facility. Along with regex I use C# 2005 which gives me a powerful capability to do everything I want with the texts.

Regex are good but they are like knife in your hand which can be used to cut your own hand also. You should be very well aware of the pros and cons of using regex. The very first thing you should consider as a corpus linguist is to search the regularities in the text. These regularities or patterns will help you find the perfect regex for the purpose. The best strategy is to analyse the data manually e.g. by inspecting concordance lines in search of the required constructions. After you have inspected and found the ways in which the construction is occurring, you can create a good and regex. But remember cross check, double check and recheck your regular expression to verify does it doing the maximum? Does the loss is minimum? And finally does it affordable? Affordable here I mean if it is hasty to add every construction and thus increasing your work. Regex give power and flexibility but they should be carefully used. They should be constructed with great care and also verified with manual analysis. And the most important thing, use regex to get concordance lines which you will inspect manually thus you can reduce your work as well as quality would be maintained.

Wednesday, May 12, 2010

Sorting a Dictionary in C# 2

During corpus processing tasks, I need to create and sort dictionaries for very simple tasks of frequency list generation. For this purpose I always have to seek on the web for solutions. C# is now at version 4 and it has lots of variations and innovations in its syntax. There are additional typing conventions, keywords, namespaces like powerful System.LINQ and extension methods. But being a linguistics student I am unable find time to upgrade my skills from C# 2005 a.k.a C# 2 to latest versions. It is not necessary perhaps also because I need C# for text processing tasks which is done very easily with C# 2 and its System.Collections.Generic namespace.
So here is a code sample which can be used to sort a dictionary with respect to its values. I have blended 2 or 3 methods into one so that it takes input as dictionary and gives output as dictionary. Code may be inefficient due to my inabilities in programming but still it works for me. Hopefully for you it would work also. :-)

public static Dictionary Sort (Dictionary dict)
{
List> list = new List>();
foreach(KeyValuePair kvp in dict)
{
list.Add(kvp);
}
list.Sort(
delegate(KeyValuePair firstPair,
KeyValuePair nextPair)
{
return nextPair.Value.CompareTo(firstPair.Value);
}
);
Dictionary d = new Dictionary();
foreach(KeyValuePair kvp in list)
{
d.Add(kvp.Key, kvp.Value);
}
return d;
}

Pages