|
|
Welcome to Aspire Knowledge Central (AKC) Home Public Library Author Content
05.00 Lucene Notes
A brief history
1997 - Doug cutting started
2000 - Goes to source forge
2004 - widely accepted
What is simpy?
This is social bookmarking website or service created by Otis, co-author of Lucene in Action.
Go to the site
General idea of indexing
Following objects are used in accomplishign indexing
IndexWriter
Analyzer
Document
Field
You prime an index writer with a directory path and an analyzer. Index writer will write the indices in this directory path. Index write will utilize the analyzer to process the documents to look for significant items to index.
A document (unlike an html or word document) is a collection of fields. Where each field can contain any amount of text. You drop this document in the index writer to index it. You can add any number of documents to this index writer
A field depending on its type (keyword, unindexed, unstored, text) may or may not be kept in the document.
General idea of searching
The counter part of the indexwriter is an indexsearcher taking the directory path as its input. It is now ready to search for a specific query. A query would have been obtained from your input string and a QueryParser. A query parser will take an analyzer as its input
In summary the relevant objects are
IndexSearcher
Query
QueryParser
Analyzer
Hits
Where Hits is a collection of lazy loaded documents obtained by the search
What is the role of fields in searching
While searching the queryparser takes a field name as one of its arguments as a default field. What is the general role of fields in searching? What happens if you don't specify a field name for a search query?
You can do the following with Lucene
A quote -
"Google could index the articles but we wouldn't be able to show results based on questions such as, "show me all the articles by Professor Henry that deal with relativity and have superstring in their title."
- by Thomas Paul at Java Ranch
Parallels to database indexing
A lucene index is a data store that is similar to a table. You can search that index like you search a table. Documents are inserted into the table as rows ("added" to be precise). The documents may or may not have all the same fields (columns) in them. Each row (or the document) has fields that are indexed and those that are not (like in a database).
How is the web crawled
A nice article by Keld Hansen
He says
"The first step is to find out how to "crawl the web". That is: request a page using the HTTP protocol, receive the page, extract the text in the page, and harvest the links in the page. Then repeat this process for every link found."
The interesting conclusion then is, if your page is not available as a link on an already public site, then it is hidden from the crawlers.
Look for some articles on "relevance"
So far, the search has been for a certain amount of key words that are known to the user. Look for strategies where given a document worth of information, look for similar documents that are in the database already.
This is probably being done by such players as Google already. wonder if their desktop toolkit has this built in already.
What about lucene? Look for some literature or their news group for this subject.See what Pramod came up from the book.
See some of the researchers at Stanford has any information on this.
Glossary of IR terms
Glossary
This is very useful as the most ideas in IR are here and what they are called in literature. This will allow us to search for those ideas in google.
For example see the following extract
Content-Based Filtering: The process of filtering by extracting features from the text of documents to determine the documents' relevance. Also called "cognitive filtering".
Content based filtering - Oard and Marchionini
Read the article
See if the articles presents strategies
Go through lucene faq page on jguru
jguru faq
See if I can find out about the content based filtering here
Example of a term frequency vector
{content: 0/1, 02/1, 03/1, 04/1, 05/1, 1/4, 10/4, 12/1, 14/1, 2/1, 2.0/1,
2005/8, 22/1, 24/5, 26/1, 27/5, 28/1, 33/2, 34/2, 36/1, 5/1, access/1,
accessed/1, agent/5, agentdao/1, akc/1, already/1, am/4, append/1,
architectural/1, architecture/3, author/3, b/1, back/1,
between/1, blogs/1, blue/2, ccp/9, central/1, channel/1, class/1,
classic/1, clearcase/1, column/1, content/1, create/1, cross/1,
current/2, cvs/1, cvsroot/1, data/4, default/1, delivery/1,
develop/1, directory/1, display/1, doc/2, docs/3, embedded/1,
enquiry/1, essentially/1, excel/1, feedback/1, fileupload/1,
florida/1, folder/1, format/1, framework/1, friday/4, from/1,
functionality/2, general/1, generic/2, get/1, go/1, google/1,
have/1, home/2, host/1, how/3, i/1, idea/1, information/1,
initial/1, interface/5, june/8, knowledge/2, library/2, links/1, look/1,
main/1, manage/3, managers/1, manipulate/1, mapping/1, masterpage/1, model/1,
monday/4, mq/1, much/1, my/2, needs/1, new/4, next/1, object/1, obtained/1,
other/1, page/1, paging/3, parent/1, password/2, path/1, piece/1, plans/1,
pm/4, pmfweb/1, port/1, portal/5, print/1, products/1, project/1,
prototype/1, pserver/1, public/1, purpose/1, put/1, r2/1, r3/1, r3/saa/1,
r4/2, rating/2, read/4, records/2, release/3, releases/1, repository/1,
request/1, requirements/2, requires/1, response/1, returning/1, review/1,
sales/1, satya/8, schedules/1, search/1, see/4, senior/1, service/1, set/1,
shield/1, siebel/5, site/1, sorting/1, specs/1, staff/1, standard/1,
strategic/1, sufficient/2, summary/1, support/1, sync/1, test/1, text/1,
through/3, together/1, ui/1, urls/3, validate/1, via/1,
vision/1, web/2, welcome/1, what/4, windows/1, work/5, xml/8}
What on earth is a docnum?
in lucene the indexreader can give you this termfrequency vector if you know the document number. To get this document number you need to do
int docnumber = hits.id(n);
Looks like the id is the docnumber
Here is its term frequency vector
{content: 1/1, 1356/8, 2/1, 216.187.231.34/3, 216.187.231.34/akc/2, 3/1, 8080/1,
8080/akc/1, about/1, above/5, absolute/2, access/1, account/1, additional/1,
address/3, adress/1, advantage/2, advantageous/1, akc/9, aliases/1, all/1, also/2, any/3,
application/2, application1/1, application2/1, applications/1, approached/1, argument/4,
arguments/5, article/1, aspect/1, aspire/1, associate/1, assumes/1, available/2, background/1,
based/2, because/2, belongs/1, both/1, browser/6, called/5, came/1, can/9, care/1, case/2, change/3,
class/1, client/4, clients/1, comma/1, completely/1, consider/1, create/1, creating/2, deal/1,
decide/1, declare/1, definition/1, deliver/2, delivered/2, dependent/1, devlivering/1,
different/1, discussed/1, display/3, displayed/1, displaynotempurl/7, displayservlet/10, divided/1,
do/2, document/3, doesn't/2, don't/2, done/1, downloaded/1, dyanmic/1, earlier/1, either/1, equivalent/1,
especially/1, etc/2, ever/1, every/1, example/2, existing/1, explanation/1, explicitly/1, far/1, file/2,
filename/2, first/1, focuses/1, follow/2, following/3, follows/1, from/5, ftp/1, fully/1,
further/2, gets/1, given/1, google/1, guess/1, handed/1, hari/7, has/2, have/5, hiding/1, host/4,
host/application1/servlet/1, host/application2/servlet/1, house/1, how/3, html/1, http/8, i/2, id/1,
identified/1, identifier/6, identifies/3, identifying/1, inside/1, instruction/1, internal/1,
invocation/1, ip/2, its/1, java/6, just/1, keep/1, key/1, know/2, known/2, knows/1, komatineni/7,
let/2, lik/1, like/2, limit/1, linking/1, links/4, list/1, located/2, logic/2, logical/1, long/2,
look/1, lookup/1, machine/7, mail/1, maintain/1, maintains/1, make/1, mappings/1, master/2, may/1,
me/2, meaningful/1, means/2, methods/1, much/1, myservlet/2, name/5, names/3, need/1, needs/3,
new/3, next/1, notebook/1, notice/2, now/2, nuances/1, number/8, one/1, only/1, ordinary/1,
other/1, over/1, owner/1, owneruserid/1, page/12, pages/4, paint/1, pairs/1, parent/1, part/8,
particular/1, parts/1, path/2, people/1, points/1, port/10, ports/2, possible/1, practical/1,
prefix/2, primarily/1, process/1, properties/1, protcol/1, protocol/8, protocols/2, purpose/1,
really/2, refinement/1, relative/11, removed/1, report/1, reportid/1, request/1, reside/1, resource/2,
responsible/1, rest/2, returns/1, revisit/1, rewrite/1, rewritten/3, same/4, scheme/1, second/1, see/2,
sense/1, separate/1, separated/1, separator/1, server/11, servers/4, service/1, servlet/19, servlets/2,
several/1, short/1, shortening/1, side/1, similar/1, single/1, so/6, some/2, something/1, specific/1,
specified/2, specify/2, start/2, starts/1, static/1, stays/1, string/3, structure/1, sub/1, summary/2,
table/1, taking/1, tell/1, tells/1, them/1, think/1, through/2, two/1, type/1, understanding/2,
universal/1, up/1, uri/3, url/36, urls/7, use/1, user/2, uses/1, using/4, usually/3, value/1,
very/1, waiting/1, way/2, web/31, webapp/1, webpage/2, webserver/10, webservers/2, well/1,
what/8, when/5, where/2, which/1, while/1, won't/1, you/12, your/3}
Working with boolean queries: sample code
public static Query getRelevanceQuerySimple(List wordList)
{
//Constructing a boolean query
BooleanQuery bq = new BooleanQuery();
//Setup reused query parameters
boolean bNotRequired=false;
boolean bNotProhibited = false;
Iterator wordItr = wordList.iterator();
while(wordItr.hasNext())
{
String word = (String)wordItr.next();
//Setup a term query
TermQuery tq = new TermQuery(new Term("content",word));
//Add it with proper search criteria
bq.add(tq,bNotRequired,bNotProhibited);
}
return bq;
}
See what a multiterm query and fuzzy query can do
Can these be used for relevancy search? Check the mailing list. Check the book.
Finding similar documents
Look at the sand box code
Contents
MoreLikeThis.java
SimilarityQueries.java
These seem to have been written by Doug.
How to enable lucene for storing term frequency vectors
When the index is built, if you want to keep the term frequency vectors for a document, you need to do something special.
When you add a text field that is indexed to the document, there is a boolean variable that you need to set it to true. Example
Field.Text(x,y,true);
See the API for the Field.Text method
|