Authorship Forensics Portal

Dataset Name: The dataset name is the name of the dataset that will be created in the database all other areas of the portal. The dataset name should be unique and descriptive. It should not contain any special characters or spaces, and it should be between 3 and 50 characters long. If the name is not unique, or it contains some special reserved keywords, the dataset will not be created and an error message will be displayed. Any special characters or spaces will be removed from the name before the dataset is created.

Step 1: Enter a Name for the Dataset:

AI Generated Documents: AI generated documents can be added to the dataset to increase the number of documents available for training. This can be useful when you want to train against AI generated authors, but you do not have any in your dataset. This works by generating AI documents based on the content of each document in the dataset. ChatGPT 3.5 is used to generate a title based on each document selected in the uploaded dataset. The title is then used to generate a document using the model selected. When the 'Add AI Documents' button is clicked, a new row will be added to the table below with the selected AI model, API key, number of documents to generate, and whether the documents are randomly selected. The documents will be generated and added to the dataset when the 'Upload' button is clicked. Note: You must be logged in to use this feature.

AI Model: Select the AI model to use for generating the documents. ChatGPT 3.5 is always used to generate the title, but you can select either ChatGPT 3.5 or ChatGPT 4.0 to generate the document.
API Key: Select the API key to use for generating the documents. The API key is used to access the OpenAI API to generate the documents. Once you click the upload button the API key will be tested to ensure it is valid.
Max Docs: When this is checked, each document in the uploaded dataset will be used to generate a document. If this is not checked, only the number of documents specified in the next field will be generated. Note: The number of documents in the dataset is the automatic maximum.
Randomly Selected: When this is checked, the documents will be randomly selected from the dataset. Otherwise, the documents will be generated in the order they appear in the dataset.
Number of AI Documents to Generate: This is the number of documents to generate for each document in the dataset. If the 'Max Docs' checkbox is checked, this is the number of documents to generate for each document in the dataset. If the 'Max Docs' checkbox is not checked, this is the number of documents to generate in total.

Step 2: Add AI Documents (Optional):

You must be logged in to add AI Documents. Please log in to use this feature.

AI Model	API Key Label	Number of Documents	Randomly Selected	Actions
No AI Generated Documents

POS Model Types: Select the POS model types to create. The top 5, 15, and 50 parts of speech will be used to create the relative frequency graphs. This means that each document will be parsed and a graph with each top type selected will be created. So, if all three are selected, three graphs will be created for each document. One for the top five, fifteen and fifty parts of speech. Training will be performed for each group of POS top types in each class. At least one POS model type must be selected to continue.

Step 3: Select the POS Model Types to Create:

POS model type(s):

Top 5

Top 15

Top 50

Model Attributes: Select each training option carefully, then click the 'Add Model' button to add the selected model to the list of models to train :

Classes:

Three: The model will be trained to classify the text into three classes. This includes the authors labeled 'ChatGPT3.5', 'ChatGPT4' and any other author will be classified as 'Human'
Two: The model will be trained to classify the text into two classes. This will classify authors with names beginning with 'ChatGPT' as 'ChatGPT' and all other authors as 'Human'
All: The model will be trained to classify the text into all classes. This will classify each author as a separate class

Min Docs: This will find authors with at least the specified number of documents then only train the model on those authors. If set to 0, all authors will be included in the training set.
Max Docs: This will limit the number of documents per author to the specified number. If set to 0, all documents will be included in the training set.
Train %: This will determine the percentage of each included authors documents that will be used for training. The remaining documents will be used for testing.
Auto %: Can be either be a percentage or set to ‘false’. When set to ‘false’, the model is built using the typical pre-set training and test data. When a percentage is used however, it prevents the test data from being used during the training process and uses that percentage of the current training data as test data instead. This ensures the model trains without ever seeing the set-aside test data. However, the final model performance is always calculated using the set-aside test data.
Threshold %: This is the minimum accuracy percentage that the model must achieve before it is considered successful. If the model does not achieve this percentage, it will be trained until it does.

Step 4: Select the Models to Train:

Classes:

Min Docs:

Max Docs:

Train %:

Auto %:

Threshold %:

Classes	Min Docs	Max Docs	Train Percent	Auto	Threshold	Actions
Add a model to continue

Dataset Instructions: The dataset is the collection of documents that will be used to train the model. Each document in the dataset should be in the ttl format. The ttl format is a text file where each line is a document. Each line of the text file should use one of the following formats (Do not mix formats inside a file):

<label><optional title>"text" @en
<label><optional title>"text"
<label><>"text"
<label>"text"
<label> <title> "text"
<label> "text"

The label, which is likely the authors name is mandatory and should be between first set of angled brackets. If the author is a variant of ChatGPT, include the version number in the label like this:

<ChatGPT4>
<ChatGPT3.5>

The title which is not mandatory, can be placed between the second set of angled brackets. If no title is required, the second set of brackets can be present or not. The associated text should be between the double quotes. The text can be any length and contain any characters, however single and double quotations will be removed before processing begins. The text will be later parsed and added to the database where image files will be created based on a distribution graph of the top 5, 15, and 50 parts of speech contained in the text.

Step 5: Upload the ttl Formatted Dataset File:

Dataset Processing Status:

No Datasets