10

I am working on an app to help people learn English as a second language. I have validated that sentences help in learning a language by providing extra context. I did that by conducting a small research in a classroom of 60 students.

I have mined over hundred thousand sentences from Wikipedia for various English words (Including Barrons'800 words and 1000 most common English words)

Entire data is available at https://buildmyvocab.in

In order to maintain the quality of content, I filtered out sentences which were longer than 160 characters since they might be difficult to understand.

As a next step, I want to be able to automate the process of sorting this content in the order of ease of understanding. I myself am a non-native English speaker. I want to know what features I can use to separate easy sentences from difficult ones.

Also, do you think this is possible?

BuildMyVocab
  • 103
  • 7

1 Answers1

8

Yes. There are various metrics, such as the fogg index. Textacy in python has a nice list and implementations.

>>> ts.flesch_kincaid_grade_level
10.853709110179697
>>> ts.readability_stats
{'automated_readability_index': 12.801546064781363,
 'coleman_liau_index': 9.905629258346586,
 'flesch_kincaid_grade_level': 10.853709110179697,
 'flesch_readability_ease': 62.51222198133965,
 'gulpease_index': 55.10492845786963,
 'gunning_fog_index': 13.69506833036245,
 'lix': 45.76390294037353,
 'smog_index': 11.683781121521076,
 'wiener_sachtextformel': 5.401029023140788}
Emre
  • 10,541
  • 1
  • 31
  • 39
GrimSqueaker
  • 366
  • 2
  • 5