Write a program that preprocesses the collection. This preprocessing stage should specifically include a function that tokenizes the text. In doing so, tokenize on whitespace and remove For this task, please use your own implementation of a tokenizer.
INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS
Tasks:
- Write a program that preprocesses the collection. This preprocessing stage should specifically include a function that tokenizes the text. In doing so, tokenize on whitespace and remove For this task, please use your own implementation of a tokenizer.
- Determine the frequency of occurrence for all the words in the collection. Answer the following questions:
- What is the total number of words in the collection?
- What is the vocabulary size? (i.e., number of unique terms).
- What are the top 20 words in the ranking? (i.e., the words with the highest frequencies).
- From these top 20 words, which ones are stop-words?
- What is the minimum number of unique words accounting for 15% of the total number of words in the collection?
Example: if the total number of words in the collection is 100, and we have the fol- lowing word-frequency pairs:
Word tf
the of a
data mining
…
|
20
|
10
|
10
|
8
|
7
|
…
|
the answer to this question will be (1 word accounts for 15% of the total 100 words).
- Integrate the Porter stemmer and a stopword eliminator into your code. Answer again questions a.-e. from the previous point. (See below a link to a Java Porter stemmer implementation and to a stopwords list).
https://www.dropbox.com/s/rexuzz3j56vi4bt/Porter.java
https://www.dropbox.com/s/5789sj8v07j2id0/stopwords.txt
Attachments:
Related Questions
. Introgramming & Unix Fall 2018, CRN 44882, Oakland University Homework Assignment 6 - Using Arrays and Functions in C
DescriptionIn this final assignment, the students will demonstrate their ability to apply two ma
. The standard path finding involves finding the (shortest) path from an origin to a destination, typically on a map. This is an
Path finding involves finding a path from A to B. Typically we want the path to have certain properties,such as being the shortest or to avoid going t
. Develop a program to emulate a purchase transaction at a retail store. This program will have two classes, a LineItem class and a Transaction class. The LineItem class will represent an individual
Develop a program to emulate a purchase transaction at a retail store. Thisprogram will have two classes, a LineItem class and a Transaction class. Th
. SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:
1
Project 1
Introduction - the SeaPort Project series
For this set of projects for the course, we wish to simulate some of the aspects of a number of
. Project 2 Introduction - the SeaPort Project series For this set of projects for the course, we wish to simulate some of the aspects of a number of Sea Ports. Here are the classes and their instance variables we wish to define:
1
Project 2
Introduction - the SeaPort Project series
For this set of projects for the course, we wish to simulate some of the aspects of a number of