About Searching inside Uploaded Files

EPiServer uses both an open source library (Lucene) and Microsoft Indexing Service to create the search index for files.

In EPiServer 4 was Microsoft Indexing Service responsible for building an index for ordinary files (i.e. the upload folder) and EPiServer Indexing Service was needed to get the versioned files in the documents folder indexed. The reason EPiServer has to implement their own search for the documents folder is because the path and file name is stored in the database and the content in a file with a guid as name.

In EPiServer CMS 5 all files are stored using the VirtualPathVersioningProvider (that always stores the path and file name in the database and the content as a file with a GUID as name). For this reason must the EPiServer Indexing Service be running and the web site configured to use it if you want to search files.

So how does the keywords get into the index?

You can not just take the content of a binary word document or pdf-file. The binary file must be converted to text first and for this EPiServer relies on a part of Microsoft Indexing Service. Applications can register converters the implement a COM-interface (IFilter) and this is used by Microsoft Indexing Service, SharePoint, EPiServer or any application intrested in getting the text out of a binary document.

You can have a look on the implementation with Lutz Roeder’s .NET Reflector if you load the EPiServer.InexingService.exe and look at the class: EPiServer.IndexingService.Indexers.FileItemIndexer

Do you want to create your own EPiServer File System?

A little more analysis reveals (at least with the 5.1.422 version) that EPiServer Indexing Service does not ask the VirtualPathProvider class for the content of the file! Instead it has hardcoded knowledge of the physical location used by the VirtualPathVersioningProvider (see EPiServer.IndexingService.ItemIndexerManager.CreateDocument). This makes it impossible to create your own implementation of EPiServer’s Unified File System and get the files indexed correctly.

See also: Storing metadata attached to uploaded files and EPiServer Tech Notes.

Bookmark and Share

Tags: , , , , ,

  1. Per Bjurström’s avatar

    Nice article. The EPiServer Indexing Service is built to support the versioning file system and not to be a generic virtual path indexer, because of that it can make some assumptions of the data it’s indexing. So, to implement your own file system you have to implement search yourself, or at least do the plumbing to your own search engine of preference.

    Reply

  2. Steve Celius’s avatar

    Nice blog entry!

    Too bad that the indexer won’t support non-versioning systems. Makes it kind of a non-option to make your own file systems for customers that uses the built-in search.

    Reply

  3. steven’s avatar

    i have been investigating the search as discussed above but on x64 bit machine.
    i have installed the x64Ifilter as supplied by adobe. I have used the indexservice in debug mode and it seems to load the files into the index all ok except for .doc files. (I get a message for .doc that it could not be added to index.)
    I am also using the versioningprovider.
    The filemanager search in epi is not returning files. not even for txt files
    as this article indicates. : http://labs.episerver.com/en/Blogs/Mari-Jorgensen/Dates/2009/11/Searching-for-files-in-EPiServer-CMS-5/

    Please advise how to query the Web catalog for verification purposes on dev.
    What vpp provider to use etc..
    Regards
    Steven

    Reply

    1. Fredrik Haglund’s avatar

      Hi Steven!

      I’m on vacation and can not help you. Please, direct your question to EPiServer Support at http://world.episerver.com/Support/

      Reply