seanpedrickcase commited on
Commit
beed391
·
1 Parent(s): 02721f3

Updated requirements

Browse files
Files changed (4) hide show
  1. app.py +1 -1
  2. requirements.txt +11 -11
  3. requirements_aws.txt +10 -10
  4. requirements_gpu.txt +12 -12
app.py CHANGED
@@ -63,7 +63,7 @@ with app:
63
  # Topic modeller
64
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
65
 
66
- Uses fast TF-IDF based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-xsmall-v1](mixedbread-ai/mxbai-embed-xsmall-v1) model embeddings (384 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [Llama-3.2-3B-Instruct-Q5_K_M.gguf](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
67
 
68
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
69
 
 
63
  # Topic modeller
64
  Generate topics from open text in tabular data, based on [BERTopic](https://maartengr.github.io/BERTopic/). Upload a data file (csv, xlsx, or parquet), then specify the open text column that you want to use to generate topics. Click 'Extract topics' after you have selected the minimum similar documents per topic and maximum total topics. Duplicate this space, or clone to your computer to avoid queues here!
65
 
66
+ Uses fast TF-IDF based embeddings by default, which are fast but does not lead to high quality clusering. Change to higher quality [mxbai-embed-xsmall-v1](mixedbread-ai/mxbai-embed-xsmall-v1) model embeddings (384 dimensions) for better results but slower processing time. If you have an embeddings .npz file previously made using this model, you can load this in at the same time to skip the first modelling step. If you have a pre-defined list of topics for zero-shot modelling, you can upload this as a csv file under 'I have my own list of topics...'. Further configuration options are available such as maximum topics allowed, minimum documents per topic etc.. Topic representation with LLMs currently based on [gemma-2-it-GGUF](https://huggingface.co/unsloth/gemma-2-it-GGUF), which is quite slow on CPU, so use a GPU-enabled computer if possible, building from the requirements_gpu.txt file in the base folder.
67
 
68
  For small datasets, consider breaking up your text into sentences under 'Clean data' -> 'Split open text...' before topic modelling.
69
 
requirements.txt CHANGED
@@ -3,21 +3,21 @@ plotly==6.3.1
3
  scikit-learn==1.7.2
4
  umap-learn==0.5.9.post2
5
  gradio==5.49.1
6
- boto3==1.40.55
7
  transformers==4.57.1
8
  accelerate==1.11.0
9
  bertopic==0.17.3
10
- spacy==3.8.7
11
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0.tar.gz
12
- pyarrow==21.0.0
13
- openpyxl==3.1.5
14
- Faker==37.11.0
15
- presidio_analyzer==2.2.360
16
- presidio_anonymizer==2.2.360
17
- scipy==1.15.3
18
- polars==1.34.0
19
- sentence-transformers==5.1.1
20
- torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124
21
  #https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.2/llama_cpp_python-0.3.2-cp311-cp311-win_amd64.whl # Exact wheel specified for windows
22
  #llama-cpp-python==0.3.2 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
23
  # Specify exact llama_cpp wheel for huggingface compatibility
 
3
  scikit-learn==1.7.2
4
  umap-learn==0.5.9.post2
5
  gradio==5.49.1
6
+ boto3==1.40.72
7
  transformers==4.57.1
8
  accelerate==1.11.0
9
  bertopic==0.17.3
10
+ spacy==3.8.8
11
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0.tar.gz
12
+ pyarrow>=21.0.0
13
+ openpyxl>=3.1.5
14
+ Faker>=37.11.0
15
+ presidio_analyzer>=2.2.360
16
+ presidio_anonymizer>=2.2.360
17
+ scipy>=1.15.3
18
+ polars>=1.34.0
19
+ sentence-transformers==5.2.0
20
+ torch>=2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124
21
  #https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.2/llama_cpp_python-0.3.2-cp311-cp311-win_amd64.whl # Exact wheel specified for windows
22
  #llama-cpp-python==0.3.2 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
23
  # Specify exact llama_cpp wheel for huggingface compatibility
requirements_aws.txt CHANGED
@@ -2,20 +2,20 @@ pandas==2.3.3
2
  plotly==6.3.1
3
  scikit-learn==1.7.2
4
  umap-learn==0.5.9.post2
5
- boto3==1.40.55
6
- spacy==3.8.7
7
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0.tar.gz
8
  gradio==5.49.1
9
- pyarrow
10
- openpyxl
11
- Faker
12
- presidio_analyzer==2.2.35
13
- presidio_anonymizer==2.2.35
14
- scipy
15
- polars
16
  transformers==4.57.1
17
  accelerate==1.11.0
18
  bertopic==0.17.3
19
- sentence-transformers==5.1.1
20
  spaces==0.42.1
21
  numpy==2.2.6
 
2
  plotly==6.3.1
3
  scikit-learn==1.7.2
4
  umap-learn==0.5.9.post2
5
+ boto3==1.40.72
6
+ spacy==3.8.8
7
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0.tar.gz
8
  gradio==5.49.1
9
+ pyarrow>=21.0.0
10
+ openpyxl>=3.1.5
11
+ Faker>=37.11.0
12
+ presidio_analyzer>=2.2.360
13
+ presidio_anonymizer>=2.2.360
14
+ scipy>=1.15.3
15
+ polars>=1.34.0
16
  transformers==4.57.1
17
  accelerate==1.11.0
18
  bertopic==0.17.3
19
+ sentence-transformers==5.2.0
20
  spaces==0.42.1
21
  numpy==2.2.6
requirements_gpu.txt CHANGED
@@ -3,22 +3,22 @@ plotly==6.3.1
3
  scikit-learn==1.7.2
4
  umap-learn==0.5.9.post2
5
  gradio==5.49.1
6
- boto3==1.40.55
7
  transformers==4.57.1
8
  accelerate==1.11.0
9
- torch==2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124
10
  bertopic==0.17.3
11
- spacy==3.8.7
12
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0.tar.gz
13
- pyarrow
14
- openpyxl
15
- Faker
16
- presidio_analyzer==2.2.355
17
- presidio_anonymizer==2.2.355
18
- scipy
19
- polars
20
- llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
21
- sentence-transformers==5.1.1
22
  spaces==0.42.1
23
  numpy==2.2.6
24
 
 
3
  scikit-learn==1.7.2
4
  umap-learn==0.5.9.post2
5
  gradio==5.49.1
6
+ boto3==1.40.72
7
  transformers==4.57.1
8
  accelerate==1.11.0
9
+ torch>=2.6.0 --extra-index-url https://download.pytorch.org/whl/cu124
10
  bertopic==0.17.3
11
+ spacy==3.8.8
12
  en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0.tar.gz
13
+ pyarrow>=21.0.0
14
+ openpyxl>=3.1.5
15
+ Faker>=37.11.0
16
+ presidio_analyzer>=2.2.360
17
+ presidio_anonymizer>=2.2.360
18
+ scipy>=1.15.3
19
+ polars>=1.34.0
20
+ llama-cpp-python==0.3.4 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
21
+ sentence-transformers==5.2.0
22
  spaces==0.42.1
23
  numpy==2.2.6
24