Edit

Enhancement Details

116
You Must Login To Vote
Enhance the DNN search engine

Enhance the DNN search engine by adding support for indexing PDF and MS Word documents. This could be extended by adding crawling functionality to index non DNN content (e.g. html, text)

Problem:
The current DNN implementation is limited to indexing modules that implement the isearchable interface. This excludes static content such PDF, MS Word documents, html or text files that would benefit from indexing.

Rationale:
Many sites include reference to static documents such as PDFs that should be included in search results. The current DNN search implementation will not index this content.

Solution:
The Apache Lucene project includes a .Net implementation that provides crawling functionality that includes PDFs and html (amongst others). This can leveraged into the DNN search project.

Impact:
The proposed solution is a crawler. The current DNN solution indexes DNN module content marked as indexable. This change would likely have a moderate to high impact on the way indexes are created and stored. The impact would be limited to search indexing and search result references.

Risk:

Created: 6/28/2007 7:22:14 AM by Richard Dorman
Scheduled For Version:
Delivered In Version:

Return



Comments

You Must Be Logged In To Add A Comment

 Jan Meffert
11/4/2010 6:23:27 PM
DNN Search is one of the weakest point of the framework. Any modern site with significant content rely heavily on the search.
 Antonio Rizzelli
5/20/2010 5:41:24 AM
I agree with Thomas Jensen: Lucene.NET would be the best choice. Also should be considered a scheduled approach for indexing, since it can take a while extracting indexing information from pdf and word docs.
 aldeng
4/2/2009 9:56:06 PM
Tony and Fabrice are right on. I would like to be able to limit search to a module, so module developers no longer have to implement search at the module level every time using different strategies. Inline suggestions would also be great. And use a full-text catalog of the db. Would be much faster. And I second Tony's idea of APIs for CRUD operations. Great idea.
 Fabrice
2/23/2009 9:59:26 PM
DNN Search is one of the weakest point of the framework. Any modern site with significant content rely heavily on the search. Here are a few feature to look for:
- Index/search at the module level (module awareness)
- Creation of configurable tag clouds
- Context sensitive search
- Advance search
- Inline suggestions
- Index non html content
- Search for identical words or word variation

I second Tony's suggestion.
 Tony Valenti
1/9/2009 9:40:21 AM
I think that the search engine implementation should be re-though. It makes sense that a search engine website such as Google would scan the internet and look for changes because Google is not in control over the entire internet's content, however, I believe that an application framework should index content "On Insert", re-index content "On Update", and purge content "On Delete". What I mean by this is that instead of having a scheduled task that scans for changes, there should be a SearchEngine.Add/Update/Delete method that the module developers call. This will also decrease the load on the server because only the content that needs to be indexed is indexed instead of the entire site.
 Justin Jovic
10/20/2008 8:56:52 PM
Also make it so that it doesnt time out with a critical error when it doesnt find something after a given time. I hate seeing that red critical error icon every time I venture to search.
 Thomas Jensen
9/1/2008 1:39:54 PM
I dont get it - "the proposed solution is a crawler." Why not just have .NET Lucene crawl you site, and make a skin that uses Lucene for search. If any thing this is an option for module/skins developers, and should not be part of the core. The core search works fine searching against the DB instead of crawling.
 Bryan Beswick
2/28/2008 10:29:54 AM
This module looks fantastic as it is ... except that it can't search the CONTENTS of the files it stores. Once that is implemented, I'd be hard-pressed to think of anything it is lacking. In fact, with the addition of a few fields such as "body" it might be a great basis for an article manager.
 Joseph Sak
2/19/2008 3:25:54 PM
Also make it handle typos, misspellings, and non-literal search functions
 M Bouwman
9/13/2007 4:33:24 AM
I think the Search engine should be reviewed indeed, there are a lot of possible improvements to be made...
 dinodino
7/13/2007 10:25:00 AM
Should also have a way to exclude certain pages from the search. Maybe a checkbox on the "page settings" link to exclude from search engine.