Text Generation and TF-IDF Analysis

This project implements a text generation and TF-IDF (Term Frequency-Inverse Document Frequency) analysis system using various Natural Language Processing (NLP) techniques. It combines text generation capabilities with advanced text processing and analysis features.

Features

Text Generation: Uses GPT-2 model for generating text based on given prompts
Text Preprocessing Pipeline:
- Text Cleaning (removes special characters)
- Text Normalization (converts to lowercase)
- Tokenization (splits text into words)
- Lemmatization (reduces words to their base form)
- Stopword Removal
- Unique Word Extraction
TF-IDF Analysis: Calculates and analyzes term frequencies in the generated text

Requirements

Python 3.x
Required Python packages:
- scikit-learn
- nltk
- transformers
- numpy
- pandas

Installation

Clone the repository:

git clone https://github.com/Yossefmohammed/Generate-docs-and-calc-TF-IDF.git
cd Generate-docs-and-calc-TF-IDF

Install the required packages:

pip install -r requirements.txt

Download required NLTK data:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Usage

The project is implemented as a Jupyter notebook (Generate docs and calc TF-IDF.ipynb). You can run it using Jupyter Notebook or Jupyter Lab.

The notebook demonstrates:

Text generation from prompts
Text preprocessing pipeline
TF-IDF analysis of the generated text

Project Structure

Generate docs and calc TF-IDF.ipynb: Main notebook containing the implementation
README.md: Project documentation
requirements.txt: List of required Python packages

Features in Detail

Text Generation

Uses the GPT-2 model from Hugging Face's transformers library
Generates text based on user-provided prompts
Configurable maximum length for generated text

Text Preprocessing

Cleaning: Removes special characters and non-alphabetic content
Normalization: Converts text to lowercase
Tokenization: Splits text into individual words
Lemmatization: Reduces words to their base form using WordNet
Stopword Removal: Removes common English stopwords
Unique Word Extraction: Identifies unique words in the processed text

TF-IDF Analysis

Calculates term frequencies
Identifies important terms in the generated text
Provides insights into the most significant words in the corpus

Contributing

Feel free to submit issues and enhancement requests!

License

This project is open source and available under the MIT License.

Author

Youssef Mohammed

Contact Information

Email: [email protected]
Phone: 01126078938

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
Generate docs and calc TF-IDF.ipynb		Generate docs and calc TF-IDF.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Generation and TF-IDF Analysis

Features

Requirements

Installation

Usage

Project Structure

Features in Detail

Text Generation

Text Preprocessing

TF-IDF Analysis

Contributing

License

Author

Contact Information

About

Uh oh!

Releases

Packages

Languages

Yossefmohammed/Generate-docs-and-calc-TF-IDF

Folders and files

Latest commit

History

Repository files navigation

Text Generation and TF-IDF Analysis

Features

Requirements

Installation

Usage

Project Structure

Features in Detail

Text Generation

Text Preprocessing

TF-IDF Analysis

Contributing

License

Author

Contact Information

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages