How many unique words are there in the Quran?
I put forward this question to Kais Dukes, author of Quranic Arabic Corpus. My emailed question looked like this:
Dear Brother,
>
> السلام عليكم
>
> I have gone through your website and found it very essential for
> learners, researchers and for curious Muslims.
>
> I have a question to you?
> How many words are there in the Holy Quran without repetion? In other
> words, how many unique words are there in the Quran?
>
> I hope you have the answer, If your answer is from a seconday source,
> please refer to the relevant sources.
>
> مع أطيب التمنيات
>
> Md. Fazlul Haque
In response to my question, he wrote:Salamu Alaykum Fazlul Haque, To the best of my knowledge, our project is the first accurate annotated morphological work for the Quran by computer, so I would be surprised at an accurate unique word count from another secondary source. Although of course, I could be wrong. The number of unique Arabic words in the Quran is not an easy question to answer. In Arabic the concept of a "word" can have multiple technical linguistic interpretations. Based on the existing annotation we have performed at the Quranic Arabic Corpus (http://corpus.quran.com), I can provide the following statistics:
Total number of space-seperated words = 77,430 Number of *unique* surface forms (i.e. space-separated word-forms, including clitics) = 18994 Number of unique words by *stem* = 12183 Number of unique words by *root* = 1685 (not necessarily a great metric for unique word counting, e.g. pronouns have no Semitic root) Number of unique words by *lemma* = 3382 (excluding verbs, and other words where lemma is not annotated). This is a primary source (we annotated this ourselves). These figures are quite accurate, but are subject to minor revision as further checking occurs. The terms used above have technical linguistic meanings. Thus, the number of unique "words" is not only a problem of counting. Wwe have computers, so counting annotated data is in theory very simple, I produced the above statistics after 10 minutes of work just now. The issue is what metric to use ... unique white-space separated word-forms, stems, roots, lemmas, or something else? Unlike English, Arabic is a highly inflected and morphologically rich language, with multiple segments often fused into a single word-form. As an estimate, I would say that there are at most 7,000 unique "words" in the Quran in the sense of what you would need to have a lexicon with wide-ranging coverage for the Quran. Something also interesting to note, is the Zipfian distribution. A handful of words (e.g. the top 100 words) will cover a very large percentage of the actual Quran, i.e. most verses. (the 80/20 rule). You might be interested in these web pages: http://corpus.quran.com/lemmas.jsp - List of unique lemmas in the Quran organized by frequency http://corpus.quran.com/verbs.jsp - List of unique verbs in the Quran organized by frequency Sorry for giving you such a vague linguist's response, but in Arabic the concept of a unique word is itself vague, and Arabic linguists (or at least computational Arabic linguists) tend to prefer to work with better defined terms such as the white-space separated tokens, surface form, lemma, stem and root, but even then those terms also have problems :-) I would suggest that the above two web pages with lists of most frequently occurring lemmas and verb roots, are probably more what you are looking for. If you have any further questions, please ask, I would be happy to help. -- Kais Dukes Language Research Group School of Computing University of Leeds