Dataset Instructions: The dataset is the collection of documents that will be used to train the
model. Each document in the dataset should be in the ttl format. The ttl format is a text file
where each line is a document. Each line of the text file should use one of the following formats
(Do not mix formats inside a file):
<label><optional title>"text" @en
<label><optional title>"text"
<label><>"text"
<label>"text"
<label> <title> "text"
<label> "text"
The label, which is likely the authors name is mandatory and should be between first set of angled
brackets. If the author is a variant of ChatGPT, include the version number in the label like
this:
<ChatGPT4>
<ChatGPT3.5>
The title which is not mandatory, can be placed between the second set of angled brackets.
If no title is required, the second set of brackets can be present or not. The associated text
should be between the double quotes. The text can be any length and contain any characters, however
single and double quotations will be removed before processing begins. The text will be later parsed
and added to the database where image files will be created based on a distribution graph of the
top 5, 15, and 50 parts of speech contained in the text.