Author Archives: steph

505(b)(2) Drug Discovery with dedicated Search Engines

505(b)(1) vs. 505(b)(2) New Drug Application (NDA) Routes

New drugs whose active ingredients have not previously been approved, follow the 501(b)(1) route. This consumes billions of dollars, take years to complete, and requires extensive clinical trials.

Should a new drug contain similar active ingredients as a previously approved drug, the NDA applicant can in many cases rely on previous FDA findings. This results in a potentially substantive shorter approval process, resulting in lower cost.


The table below shows one example of a successful 505(b)(2) NDA. The drug Bendeka(R), based on the original approvel of Treanda(R). Note the period from submission to approval.

(from Seven Noteworthy 505(b)(2) Submissions, Charles O. Jaap, V, MBA, RAC, Jodi Hutchins, RAC, CQA and Mikel Alberdi, MPH, RAC)

We submitted a proposal to a 501(b)(2) consultancy, and our initial two-week exploration resulted in some interesting findings.

We examined some drug information data sources, some of which are listed below:

Drugs@FDA: Provides product specific regulator history 
FDA Orange Book: lists each product and their associated patents and exclusivity
FDA IID (inactive ingredient search)
RightFind (copyright search)
PubMed: – Citations data source
FDA Forms application forms etc
Hein Online laws?
21 CFR Search code of fed regs
DailyMed drug guidance, regs, labeling
Toxnet toxicity dbs
USP-NF guide through pharma process
REMS@FDA approved risk evaluation and mitigation
Health Canada
European Medicines Agency agency curl=/pages/medicines/landing/epar_search.jsp&mid=WC0b01ac058001d124 
FDA Drug Label Database

Our initial plan was to utilize as many APIs of the above data sources as possible. However, licensing costs and complexity (there is no standard API!) put paid to that idea.

The examples below are an import from the DrugBank  and the RxNorm downloads. Several database downloads were found to have had recent schema changes and data inconsistencies. In the interest of time, we decided to scrape the individual sites and not be the best netizens.



Google Custom Search

Google Custom Search is a very useful product! We decided to employ GCS as an interim search engine to aggregate drug documents to ingest into ElasticSearch. It has to be mentioned that it has certain drawbacks.


Watson Retrieve and Rank

We are using Watson to assist in curating the downloaded documents. This adds further ranking to individual documents.  Watson R&R definitely warrants further investigation.

Prior to Watson R&R being useful, the engine has to be trained on an uploaded document corpus. A sample of simplified training questions are listed below:

what are the active ingredients for solu-cortef?
what is the active NDA number for solu-cortef?
what is the approval date for solu-cortef?
what are the active ingredients for hydrocort?
what is the active NDA number for hydrocort?
what is the approval approval date for hydrocort?
when was the NDA issued for Nicorette?
what is an alternative name for Nicorette?
what are the active incredients for Nicorette?
what are the active ingredients for Vasotec?
what is the active NDA number for Vasotec?
what is the approval date for Vasotec?
where there any incomplete responses for Vasotec?


We further employ manual document tagging for custom curation:


Once our documents have been ingested into ElasticSearch, we perform clustering:


The pharmaceutical regulatory field is fertile as far as Data Science, Data Engineering, and especially Document Search is concerned. We hope to develop this into a commercial offering in the near future.

Please feel free to contact Steph van Schalkwyk with commercial requests. 

Steph van Schalkwyk 
314 452 2896 (direct)

Using open source Machine Learning packages to augment SharePoint Search results – DRAFT

We have been using external code to augment our SharePoint Search crawl pipeline discovery since about 2011.

Not only does SharePoint allow a simple method to link into the crawling pipeline, it also allows for automated schema addition and re-crawl if one adds a bit of our amazing code.

We use SQL Server to store interim results of discovered topics if the topic extraction code is overloaded. Those data in SQL Server are used to mark the affected documents for re-crawl so that the newly-discovered topics can be added to the search index.

DreamReport Manufacturing Report integration with SharePoint


SharePoint If your company is using SharePoint, then you already know that document management and retrieval is a critical component of a well-oiled enterprise.  When using a powerful document generator like Dream Report, you’ll want to make sure your automatic and manually created reports are managed consistently, along with your other corporate information.

While Dream Report could always be used with Microsoft SharePoint for basic document management, the Dream Report Document Management Extension adds a new level of indexing and document retrieval.  With the Extension installed, you’ll be able to index documents based on key parameters that you identify.

Basic document management is the ability to store and retrieve files.  These files are indexed by file name and date/time attributes.  Storing and retrieving Dream Report documents such as PDF files, Excel files and CSV files is what SharePoint does very well.  Dream Report will deliver documents to your folder of choice, either using your local Windows file system (with required authentications), or to remote locations through the use of File Transfer Protocol (FTP).

Dream Report Document Transfer to:

  • Windows Directories with Authentication
  • FTP File Transfers with Authentication
  • Place documents into select SharePoint Document Libraries

Dream Report Document Management Extension for Microsoft SharePoint

The Document Management Extension delivers the new ability to select variables from with-in Dream Report documents, and purposely flag them for SharePoint search engine indexing.  The Document Management Extension will monitor document directories and wait for new deliveries.  Upon delivery, the Extension will index the parameters list, storing the results in the SharePoint Search Engine index.  Users will then use the SharePoint Search Engine to analyze a query and return the Dream Reports of interest.

DR SharePoint 1
<br\>Typical queries that users will benefit from:

  • Operator Names
  • Equipment Designations
  • Product Serial Numbers
  • Pass / Fail Designations
  • Variable Results > = <
  • Date / Time
  • More – defined by the user

If Dream Report documents are removed from the directory, the Document Management Extension will also remove those documents from the SharePoint Search Engine.


Installation, Maintenance and Support

The Dream Report Document Management Extension for SharePoint is developed and supported by Ocean Data System’s partner, REMCAM LLC, experts in search engines and Microsoft SharePoint customization and support.

The Document Management Extension is delivered as a REMCAM service.  They will assist your SharePoint administrator in the installation and support of the extension.  Once installed, the Document Management Extension will run automatically and will not require administrative support.  For more information, contact USA 312-281-8982 or

Visualizing SQL Server Data with ElasticSearch, LogStash and Kibana (ELK Stack)

I was recently asked about visualizing data contained in SQL Server, using the ELK Stack. While the client eventually opted to use a local developer, I decided to quickly throw an article together.







I grabbed some sample CSV data from the SpatialKey website here: (, specifically the Sales Transactions sample, and proceeded to load it into SQL Server 2016. While I’ll be the first to admit that this is a contrived sample, the potential client called for the data to be sourced from SQL Server.

SQL Server – create a table:

Create Table [SalesTransactions2009].dbo.[Sales]
Transaction_date DATE,
Product NVARCHAR(50),
Price DECIMAL(12,2),
Payment_Type NVARCHAR(50),
[Name] NVARCHAR(50),
City NVARCHAR(50),
State NVARCHAR(2),
Country NVARCHAR(50),
Account_Created DATETIME,
Last_Login DATETIME,
Latitude DECIMAL(12, 6),
Longitude DECIMAL(12,6)

and import these data using SQL Bulk Insert:

BULK INSERT SalesJan2009 FROM 'C:\SalesJan2009\SalesJan2009.csv' WITH  (
     FIRSTROW = 2,
     FIELDTERMINATOR = ',',  --CSV field delimiter     
     ROWTERMINATOR = '\n',   --Use to shift the control to next row     
     ERRORFILE = 'C:\SalesJan2009\ErrorRows.csv',     
     TABLOCK     )

Preparing and exporting data from SQL Server to ingest into the ELK stack:

There are three options (at least what I can think of, off-hand):

  1. LogStash
  2. JSON into ElasticSearch
  3. FileBeat

In the interest of brevity, I’m only showing the LogStash option here and will leave the others as an exercise for the reader.

Export from SQL Server:

Create two SPs, one to export to CSV (for LogStash) and the other to export to JSON (for direct ElasticSearch ingestion). There are a myriad examples on the web. If you have an SQL Server version < 2016, one can structure an XML export to resemble JSON.

Create a scheduled export (a number of ways exist), and let it create a CSV file. I realize that I could just have used the original CSV file, but that would have me not use SQL Server.

Create the ElasticSearch index:

  "settings": { 
      "number_of_shards" : 1 
  "mappings" : {  
     "type" : { 
        "properties" : { 
          "location" : {  
             "type" : "geo_point" 
           "Transaction_date" : {"type" : "date" , "format" : "MM/dd/YY HH:mm" }, 
           "Account_Created" : { "type" : "date" , "format" : "MM/dd/YY HH:mm"}, 
           "Last_Login" : { "type" : "date" , "format" : "MM/dd/YY HH:mm"} 
} } } }

Then create the logstash.conf file:

input { 
  file { 
   path => "C:/sales_jan2009/SalesJan2009.csv" 
   start_position => "beginning" 
   sincedb_path => "C:/sales_jan2009/sinceDb" }  
 filter { 
  csv { 
     separator => "," 
     columns => ["Transaction_date","Product","Price","Payment_Type","Name","City","State","Country","Account_Created","Last_Login","Latitude","Longitude"] 
     remove_field => ["message"] } 
     mutate { 
        convert => { "Longitude" => "float" } 
        convert => { "Latitude" => "float" } 
        convert => { "Price" => "float" } 
     mutate {  . . . 

(Remainder left as an exercise…)

Making sure to have LogStash “watch” the data file for changes (using sincedb), one can then start ElasticSearch, Kibana and LogStash.

If one is lucky, the LogStash log will have entries such as this:

 [2017-06-06T15:21:18,342][DEBUG][logstash.pipeline ] output received { 
 "event"=>{"Product"=>"Product1", "Latitude"=>47.60639, "City"=>"Seattle ", 
 "Account_Created"=>"1/8/09 18:59", "Payment_Type"=>"Visa", 
 "Longitude"=>-122.33083, "Name"=>"Reto", "
 "@timestamp"=>2017-06-06T20:21:15.247Z, "Price"=>1200.0, 
 "Last_Login"=>"2/18/09 15:51", "State"=>"WA", "@version"=>"1", 
 "host"=>"SP01", "Country"=>"United States", 
 "Transaction_date"=>"1/9/09 15:03"}}

Switching to Kibana, one can then create individual visualizations and place these onto a dashboard.

Feel free to ping me with commercial requests. I work on Windows, Linux and occasionally other platforms. My forte is Enterprise Search (ElasticSearch, SOLR, SharePoint), Big Data, and SharePoint.

Steph van Schalkwyk

Using Packer and Vagrant to install your SharePoint environments on Windows and Linux!

Ok, time to update this.
I run all my development environments (SharePoint, SOLR, ElasticSearch, SCALA) in containers, either Docker, VMWare or VirtualBox. Doing it this way has allowed my hosts (Ubuntu on a 24 core Intel Server Platform) to remain stable, and not to interfere with development time. Too much time is wasted debugging platforms!
The steps below are applicable to both a Windows host as well as a Linux host.
1. Download and install Packer:
2. Download and install Packer-Windows: . You can either grab the zip file or do #git clone
3. In the downloaded Packer-Windows folder, you’ll find a number of *.json files. For SharePoint, copy the windows-2012-r2.json file. Name it anything that makes sense to you.
4. This is the fun part. Edit the json file:
     a. The first part is the builders. Here you can choose which builders to use. I use both VMWare and VirtiualBox. For SharePoint, you cannot use Docker, for obvious reasons.
          i. My build adds another CD Drive to the machine, which allows me to run an unattended Windows Update from the CD Drive, as opposed to the tedious process of updating from the web. Note that the second CD has to have the full path to the folder specified. Relative paths are only supported on the first CD drive in Packer.
          ii. These also show the machine screen while building (“headless”:”false”).
          iii. The floppy files entries are pretty standard
  "builders": [
             "type": "vmware-iso",
             "iso_url": "./iso/9200.16384.WIN8_RTM.120725-1247_X64FRE_SERVER_EVAL_EN-US-HRM_SSS_X64FREE_EN-US_DV5_UPDATED.ISO",
             "iso_checksum_type": "md5",
             "iso_checksum": "5ef4590ebc4f9c4ff230e504e1cf236e",
             "headless": false,
             "boot_wait": "2m",
             "ssh_username": "vagrant",
             "ssh_password": "vagrant",
             "ssh_wait_timeout": "6h",
             "shutdown_command": "shutdown /s /t 10 /f /d p:4:1 /c \"Packer Shutdown\"",
             "guest_os_type": "windows8srv-64",
             "tools_upload_flavor": "windows",
             "disk_size": 61440,
             "vnc_port_min": 5900,
             "vnc_port_max": 5980,
             "skip_compaction": "true",
             "floppy_files": [
             "vmx_data": {
                     "ide1:1.devicetype": "cdrom-image",
                     "ide1:1.filename": "/home/svanschalkwyk/Projects/packer-windows/iso/wsusoffline-w63-x64.iso",
                     "ide1:1.present": "true",               
                     "RemoteDisplay.vnc.enabled": "false",
                     "RemoteDisplay.vnc.port": "5900",
                     "memsize": "8192",
                     "numvcpus": "2",
                     "scsi0.virtualDev": "lsisas1068"
            "type": "virtualbox-iso",
               "iso_url": "./iso/9200.16384.WIN8_RTM.120725-1247_X64FRE_SERVER_EVAL_EN-US-HRM_SSS_X64FREE_EN-US_DV5_UPDATED.ISO",
               "iso_checksum_type": "md5",
               "iso_checksum": "5ef4590ebc4f9c4ff230e504e1cf236e",
               "headless": false,
                "boot_wait": "2m",
                "ssh_username": "vagrant",
                "ssh_password": "vagrant",
                 "ssh_wait_timeout": "6h",
                 "shutdown_command": "shutdown /s /t 10 /f /d p:4:1 /c \"Packer Shutdown\"",
                 "guest_os_type": "Windows2012_64",
                  "disk_size": 61440,
                  "floppy_files": [
       "vboxmanage": [
                "Sata Controller",
                "IDE Controller",
5. You can now edit the provisioners section. This section takes care of all the post-windows-installation steps. In addition to the entries below, I also install Chocolatey as well as 7zip, notepad++ and StartMenu8 (for those of us who don’t want to spend time swiping the left hand of the screen for “charms”…)
"provisioners": [
 "type": "shell",
 "remote_path": "/tmp/script.bat",
 "execute_command": "{{.Vars}} cmd /c C:/Windows/Temp/script.bat",
 "scripts": [
 "type": "shell",
 "inline": [
 "rm -rf /tmp/*"
 "post-processors": [
 "type": "vagrant",
 "keep_input_artifact": true
6. Find the related Vagrant file. In this case it is vagrantfile-windows_2012_r2.template. If you have anything you want to override, you can do it in this file.
I usually add two more forwarded ports: "forwarded_port", guest: 80, host: 8080, id: "iis-defaultsite" "private_network", ip: ""

and allow for a gui:

     v.gui = true
All the other entries are usually the same.
7. Find the appropriate answer file in the answer_file folder. For brevity, I won’t show it here. Two of the important sections are “< !– WITHOUT WINDOWS UPDATES — >” and “< !– WITH WINDOWS UPDATES — >”. Select the one which you prefer to have. You can also enter your Windows license here.
8. I also use NTLite ( to edit the installation. I haven’t been able to successfully slipstream my installation disk with NTLight though. Your experience may differ.
9. Select the version of Windows 2012-R2 you want to install:
          Name : Windows Server 2012 R2 SERVERSTANDARDCORE
          Description : Windows Server 2012 R2 SERVERSTANDARDCORE
          Name : Windows Server 2012 R2 SERVERSTANDARD
          Description : Windows Server 2012 R2 SERVERSTANDARD
          Name : Windows Server 2012 R2 SERVERDATACENTERCORE
          Description : Windows Server 2012 R2 SERVERDATACENTERCORE
          Name : Windows Server 2012 R2 SERVERDATACENTER
          Description : Windows Server 2012 R2 SERVERDATACENTER

10.  I usually add to additional forwarded ports to the Vagrant file. If you’re not going to use the final Vagrant step (see below), these can be added to the packer configuration: "forwarded_port", guest: 80, host: 8080, id: "iis-defaultsite" "private_network", ip: ""
11.  Now for your custom scripts. These are typically in the scripts folder.
I have two for my Chocolatey installs:
     a. chocolatey.bat"

          @powershell -NoProfile -ExecutionPolicy Bypass -Command "iex ((new-object net.webclient).DownloadString(''))" && SET PATH=%PATH%;%ALLUSERSPROFILE%\chocolatey\bin
     b. chocolatey_packs.bat:

          :: Ensure C:\Chocolatey\bin is on the path
           :: Install all the things; for example:
           cmd /c %ALLUSERSPROFILE%\chocolatey\bin\choco install 7zip -y
           cmd /c %ALLUSERSPROFILE%\chocolatey\bin\choco install notepadplusplus -y
           cmd /c %ALLUSERSPROFILE%\chocolatey\bin\choco install startmenu8 -y
Do make sure the network is available when these run as Chocolatey downloads from the web.
12. You can now add any other post-windows installation scripts you need.
13. Right. Are you ready to run Packer on this?
First, validate your json file for correctness.
The command is
c:\your dir\packer validate your_windows_2012_r2.json or $/your dir/packer validate your_windows_2012_r2.json
14. If that succeeds, you can build the boxes.
To select only one type of machine (VMWare or VirtualBox), you can specify the builder to use.
packer build -only vmware-iso your_windows_2012_r2.json 
15. In my build, I reqested the post-processors to keep the virtual machine files after building, so the virtual machines will be in the folders
     output-vmware-iso and 

In my case it makes better sense to keep the virtual machines and not go through the additional step of having to run Vagrant afterwards to provision them. Your needs may differ.
16. Configuring SharePoint. I use AutoSPInstaller and a number of script files to automatically configure SP. This is for another post.
Good luck!

Adding CD Drives to Packer Virtual Machines

QuickFix: Add additional CD Drives to Packer Machines

Add the following top section:
"./scripts/oracle-cert.cer" --- existing command
      "vboxmanage": [
         [ "modifyvm", "{{.Name}}", "--memory", "8192" ],
         [ "modifyvm", "{{.Name}}", "--cpus", "2" ],
           [ "storagectl", "{{.Name}}", "--name", "Sata Controller", "--add", "sata", "--controller",      "IntelAHCI" ],
           [ "storageattach", "{{.Name}}", "--storagectl", "IDE Controller", "--port", "1", "--device",
           "0", "--type", "dvddrive", "--medium",
                    "./scripts/openssh.ps1" --- existing command 
             "vmx_data": {
                     "ide1:1.devicetype": "cdrom-image",
                     "ide1:1.filename": "/home/svanschalkwyk/Projects/packer-windows/iso/wsusoffline-w63-x64.iso",
                     "ide1:1.present": "true",               
                     "RemoteDisplay.vnc.enabled": "false",
                 "RemoteDisplay.vnc.port": "5900",
                 "memsize": "8192",
                 "numvcpus": "2",
                 "scsi0.virtualDev": "lsisas1068"
Hope this helps someone.