Tuesday, August 13, 2013

Number of Unique Words in the Quran.

How many unique words are there in the Quran?

 I put forward this question to Kais Dukes, author of Quranic Arabic Corpus. My emailed question looked like this:
Dear Brother,
>
> السلام عليكم
>
> I have gone through your website and found it very essential for
> learners, researchers and for curious Muslims.
>
> I have a question to you?
> How many words are there in the Holy Quran without repetion? In other
> words, how many unique words are there in the Quran?
>
> I hope you have the answer, If your answer is from a seconday source,
> please refer to the relevant sources.
>
> مع أطيب التمنيات
>
> Md. Fazlul Haque
 
 In response to my question, he wrote:

Salamu  Alaykum Fazlul Haque,

To the best of my knowledge, our project is the first accurate
annotated morphological work for the Quran by computer, so I would be
surprised at an accurate unique word count from another secondary
source. Although of course,  I could be wrong. The number of unique
Arabic words in the Quran is not an easy question to answer. In Arabic
the concept of a "word" can have multiple technical linguistic
interpretations. Based on the existing annotation we have performed at
the Quranic Arabic Corpus (http://corpus.quran.com), I can provide the
following statistics:
Total number of space-seperated words = 77,430
Number of *unique* surface forms (i.e. space-separated word-forms,
including clitics) = 18994
Number of unique words by *stem* = 12183
Number of unique words by *root* = 1685 (not necessarily a great
metric for unique word counting, e.g. pronouns have no Semitic root)
Number of unique words by *lemma* = 3382 (excluding verbs, and other
words where lemma is not annotated).

This is a primary source (we annotated this ourselves). These figures
are quite accurate, but are subject to minor revision as further
checking occurs. The terms used above have technical linguistic
meanings. Thus, the number of unique "words" is not only a problem of
counting. Wwe have computers, so counting annotated data is in theory
very simple, I produced the above statistics after 10 minutes of work
just now. The issue is what metric to use ... unique white-space
separated word-forms, stems, roots, lemmas, or something else? Unlike
English, Arabic is a highly inflected and morphologically rich
language, with multiple segments often fused into a single word-form.

As an estimate, I would say that there are at most 7,000 unique
"words" in the Quran  in the sense of what you would need to have a
lexicon with wide-ranging coverage for the Quran. Something also
interesting to note, is the Zipfian distribution. A handful of words
(e.g. the top 100 words) will cover a very large percentage of the
actual Quran, i.e. most verses. (the 80/20 rule).

You might be interested in these web pages:

http://corpus.quran.com/lemmas.jsp - List of unique lemmas in the
Quran organized by frequency
http://corpus.quran.com/verbs.jsp - List of unique verbs in the Quran
organized by frequency

Sorry for giving you such a vague linguist's response, but in Arabic
the concept of a unique word is itself vague, and Arabic linguists (or
at least computational Arabic linguists) tend to prefer to work with
better defined terms such as the white-space separated tokens, surface
form, lemma, stem and root, but even then  those terms also have
problems :-)

I would suggest that the above two web pages with lists of most
frequently occurring lemmas and verb roots, are probably more what you
are looking for.

If you have any further questions, please ask, I would be happy to help.

-- Kais Dukes

Language Research Group
School of Computing
University of Leeds


18 comments:

  1. Thank you for your post brother. Where can I find a list of the most commonly used words, that forms 80% of the Quran.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  2. You can download the 80% words of the Quran from this link: http://mebk12.meb.gov.tr/meb_iys_dosyalar/41/02/174154/dosyalar/2014_01/02125602_denkelimekartlarngarap..pdf

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. Assalamu alaikum Fazlul Haque,
      I followed the link you entered here and found the 80% words of the Quran pdf file. I would like to use the information in it in a work that I am doing. Do you know if it is copy righted or How I can contact the owner.
      Thank you JazakaAllahu Khair

      Delete
    3. 🕌 “The Essential Book of Quranic Words” by Abrar Khan
      http://quranicwords.com

      Delete
  3. Very useful, the answer is exactly the one I am looking for!!!!

    ReplyDelete
  4. What is the meaning of the word wenhar in sura 108/2?

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Thank you for your post brother. Where can I find a list of 2000 words not reapeted in alquran?

    ReplyDelete
  7. Actually that would be based on Quranic Verb ROOTS plus words without ROOTS. I am Trying to find them out.

    ReplyDelete
  8. Thank you very much brother for your contribution

    ReplyDelete
  9. excellent work
    check this book too
    https://thequranicwords.wordpress.com/free-version/

    ReplyDelete
    Replies
    1. 🕌 “The Essential Book of Quranic Words” by Abrar Khan
      http://quranicwords.com

      Delete