In the previous post, Generating a Large PDF from Website Contents - HTML to PDF, Bookmarks and Handling Empty Pages, we saw how to generate a PDF from HTML and add bookmarks to the generated PDF files. The PDF file generated is for an individual section which now needs to be merged to form a single PDF file. The individual PDF files contain the relevant content for the section and related bookmarks, which needs to be combined into a single PDF file.
One of the important things to keep intact when merging is the document hierarchy. The Sections, Sub-Categories, and Categories should align correctly so that the final bookmark tree and the Table of Contents appear correctly. It is best to maintain the list of individual PDF document streams in the same hierarchy as required. Since we know the required structure right from the UI, this can be easily achieved by using a data structure similar as shown below
1 2 3 4 5 6 7 8
The above structure allows us to maintain a tree-like structure of the document. The structure is the same as that is provided to the user to select the PDF options. I used the iTextSharp library to merge PDF documents. To interact with the PDF, we first need to create a PdfReader object from the stream. Using the SimpleBookmark class, we can get the existing bookmarks for the PDF.
iText representation of bookmarks is a bit complex. It represents them as an ArrayList of Hashtables. The Hashtable has keys like Action, Title, Page, Kids, etc. Kids property represents child bookmarks and is the same ArrayList type. Since it was hard to work with this structure, I created a wrapper class to interact easily with the bookmarks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
Recursively iterating through the list of DocumentSections, I add all the bookmarks to a root Bookmark class. The root bookmark class represents the full bookmark of the PDF file. The PageNumber is offset using a counter variable. The counter variable is incremented by the number of pages in each of PDF section (pdfReader.NumberOfPages) as it gets merged to the bookmark root. This ensures that the bookmark points to the correct bookmark page in the combined PDF file.
The individual documents are then merged by iterating through all the generated document sections. Once done we get the final PDF as a byte array which is returned to the user.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
To generate a Table of Contents (ToC), we can use the root bookmark information. We need to manually create a PDF page, read the bookmark text and add links to the page with the required font and styling. iText provides API’s to create custom PDF pages.
We are now able to generate a single PDF based on the website contents.
Very often I need to sign forms, receipts, invoices in PDF format and send them across to someone else. Printing the PDF, signing them physically and scanning them back (of course using Office Lens) is how I used to do this until a while back. Since I don’t have a printer at home, I always had to wait till I reach office. Also, I did not like wasting paper and ink just to put a signature.
Adobe PDF reader allows us to ‘Fill and Sign’ documents. Using this option we can add signatures without needing to print them. Follow the below steps to set up your Adobe Reader to sign any document.
1. Sign on a white paper and take a picture. Upload the picture onto your computer and crop the image using your favorite image editor. You should have something similar as shown below.
2. Open the PDF file that you need to sign with Adobe Reader.
3. Open ‘Fill and Sign’ option. You can do this either from the ‘Tools Pane’ (Shift + F4 on windows) or the menu ‘View -> Tools -> Fill and Sign.’
4. Under the Sign option, you can choose a signature image. Choose the image you created before and save.
You are all set to sign documents now. Anytime you want to sign a document, choose ‘Fill and Sign’ and you will see your signature under the Sign button. Click the signature and place it anywhere on the document that you want to sign. No more printing and scanning them back again.
Scanning physical documents can be cumbersome using a scanner, especially if you do not have easy access to one. Taking pictures with the default camera application on the phone might not give the best of results that you are expecting. Also, you will mostly end up needing to trim such photos of unwanted elements.
Microsoft Office Lens is the perfect application for scanning documents and whiteboards. Office Lens focuses documents in the camera frame and allows you to capture just what is required. It enhances the selected document sections. Below is an example of the highlight and the captured document.
- Capture and crop a picture of a whiteboard or blackboard, and share your meeting notes with colleagues.
- Make digital copies of your printed documents, business cards or posters, and trim them precisely.
- Printed text will be automatically recognized (using OCR) by converting Word and PDF, so you can search for words in images and copy and edit them.
‘I need some undistracted time.’
This was one of the things that came up in my team’s retrospective yesterday. Having some undistracted time is necessary for getting things done. It’s a good practice to have a consensus among the team members on how to manage disruptions and indicate whether you are open for a chat.
The Headphone Rule is an interesting way to indicate whether a person is open to interactions or not.
no headphones, you can talk to me.
1 headphone, you can talk to me about work
2 headphones, do not talk to me.
For people who do not use a headphone, some other technique needs to be used (like sticky notes, colored lights, etc.). Luckily in my team, everyone uses headphones, and it was an acceptable solution. Irrespective of the way you choose it is important to have some agreed way to indicate whether you are interruptible or not. It helps you and the team to have some undistracted time.
If you are a .NET developer and looking for some awesome free stuff, then check out Visual Studio Dev Essentials. You get loads of free stuff
Free tools, cloud services, and training
Get everything you need to build and deploy your app on any platform. With state-of-the-art tools, the power of the cloud, training, and support, it’s our most comprehensive free developer program ever.
Some of the key attractions of the program are
- $300 Azure Credit for a year
- Access to Xamarin University Training
- Pluralsight access for three months
- WintellectNOW access for three months
All you need is a Windows Live ID to signup. Get it if you have not already!
Last week was a busy one at NDC Sydney and was happy to be back there for the second time.The conference was three days long with 117 speakers, 37 technologies, and 151 talks. Some of the popular speakers were Scott Wlaschin, Scott Allen,Troy Hunt, Damian Edwards, Steve Sanderson and a lot more.
Each talk is one hour long and eight talks happen at the same time. Below are the talks I attended:
- Keynote: Using EEG and Machine Learning to Perform Lie Detection
- A teams transition to Continuous Delivery
- Docker, FROM scratch
- The Technical Debt Prevention Clinic
- How to start and run a software lifestyle business
- Asynchronous Programming From The Ground Up
- Building Docker Applications with .NET - tooling, cross platform support and migration
- Hack Your Career
- Writing high performance code in .NET
- Growing Serverless code with Azure Functions and F#
- “The website’s down!” Stories and lessons on keeping your website up
- Self-Aware Applications: Automatic Production Monitoring
- Domain Modeling Made Functional
- Interactive C# Development with Roslyn
- Building Resilient Applications In Microsoft Azure
- Functional Design Patterns
- Logic vs. side effects: functional goodness you don’t hear about
- How one team built their first microservice
All sessions are recorded and are available here. The Sydney 2017 ones will soon be there. Overall it was a good event but did not match the one last year. Last year there were more of the popular speakers and the talk content was also more interesting. But still, I am glad that NDC Sydney is still happening, and it gives a good exposure and networking possibilities for developers. Thanks to Readify for sponsoring my tickets and it’s one of the good things about working with Readify.
Hope to see you next year as well!
We are typists first, and programmers second.
If you are like me, spending a lot of time with a computer, it’s worth you take the time to learn to type without looking at the keyboard a.k.a touch typing (if you currently type by the hunt and peck method). Though productivity cannot be measured by the number of words you type per minute, it’s good to learn to touch type. There are various applications that help you to improve your typing speed. Some of them are online and some offline desktop applications. Pick one that suits you and improve your typing speed.
I learned to touch type only a couple of years back, getting inspired after reading Learn Anything in 20 Hours. In the book, the author explains how he learned a new keyboard layout, Colemak, in just 20 hours. The book explains the full process and setup used by the author. It uses various tools like Keyzen, Type-fu, Amphetype. More than the tools it is the process and the approach to the learning that is interesting. A summary of the tools and approaches is available here but highly recommend reading the book. I found the approach very helpful and effective and used to learn QWERTY layout.
Irrespective of the way you choose to learn touch typing, it might seem a bit hard at the start. Keep at it for some time snd you will soon see an improvement in your typing speed.
In the previous post, Generating a Large PDF from Website Contents we saw from a high level the approach taken to generate PDF files from a Content Management System (CMS) website. In this post, we will delve further into the details of each of those areas.
HTML To PDF
There are a lot of libraries and services that support converting HTML to PDF. We chose this mechanism mainly for keeping the content formatting simple and reusable. Most of the PDF data was to be structured like the website content. This means we can reuse (read copy/paste) the HTML styling for the PDF content as well.
We used Essential Objects HTML to PDF Converter library. Our website is hosted as an Azure Web App and the Essential Objects library does not work in the Azure sandbox environment. The Azure Sandbox restriction affects most of the HTML to PDF libraries. The recommended approach to use those libraries is to host the PDF conversion logic on an Azure Virtual Machine, which is what we also ended up doing. Alternatively, you can choose to use one of the HTML to PDF hosted services.
The below code snippet is what you need to convert an HTML URL endpoint to PDF. It uses the HtmlToPdf class from the EO.Pdf Nuget package. The HtmlToPdfOptions specifies various conversion and formatting options. You can set margin space, common headers, footers, etc. for the generated PDF content. It also provides extensibility points in the PDF conversion pipeline.
1 2 3 4 5 6 7 8 9 10 11
HTML Formatting Tip
You might want to avoid content being split across multiple pages. E.g., images, charts, etc. In this cases, you can use the page-break-* CSS property to adjust page breaks. Essentials objects honors the page-break-* settings and adjusts the content when converting into PDF.
A bookmark is a type of link with representative text in the Bookmarks panel in the navigation pane. Each bookmark goes to a different view or page in the document. Bookmarks are generated automatically during PDF creation from the table-of-contents entries of a document.
We generate a lot of small PDF files (per section and category/sub-category) and then merge them together to form the larger PDF. Each of the sections has one or more entries towards Table Of Contents (TOC). We decided to generate bookmarks first per each generated PDF. When merging the individual PDF, the bookmarks are merged first, and then the TOC is created from the full bookmark tree.
Bookmarks can be created automatically or manually using Essential Objects library. Most of the other libraries also provide similar functionality. Using the AutoBookmark property we can have bookmarks created automatically based on HTML header (H1-H6) elements. If this does not fit with your scenario, then you can create them manually. In our case, we insert hidden HTML tags to specify bookmarks. Bookmark hierarchy is represented using custom attributes as shown below.
1 2 3 4 5 6
Once the PDF is created from the URL, we parse the HTML content for elements with bookmark class and manually add the bookmarks into the generated PDF. The GetElementsByClassName and the CreateBookmark methods help us to create bookmarks from the hidden HTML elements in the page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Handling Empty Pages
In our case, the content is from a CMS, and the user gets an option to select what categories/sub-categories and sections of data to be displayed in the generated PDF. At times it happens that some of the selected combinations might not have any data in the system. To avoid printing a blank page (or an error page) in the generated PDF, we can check the conversion result to check for the returned content. Whenever the content does not exists the HTML endpoint returns an EmptyResult class.aspx). At the PDF conversion side, you can check if the response is empty and accordingly perform your logic to ignore the generated PDF.
1 2 3 4 5 6 7 8 9 10
Once the individual PDF files are created for each of the section and category/subcategory combination, we can merge them together to generate the full PDF. We will see in the next post how to merge the bookmarks together along with shifting the PDF pages and generating Table of Contents from the bookmarks.
If you own a website, then it is good to check their performance now and then. Various factors affect the site speed, and it’s better to use some tools to do the job for you.
YSlow is a similar tool from Yahoo!, that analyzes web pages for performance. It is available as a browser extension and has one for all popular ones. It provides a grade for the web page based on a predefined ruleset or a user-defined ruleset. The grade is calculated based on the YSlow Ruleset Matrix.
I have been implementing some of the recommendations from the above tools on this blog, but as you can see from the above results, there is still more to be done. Hope you find this helpful for your sites.
At one of my recent clients, we had a requirement to generate a PDF dynamically based on the contents of the website. The website is a Content Management System (CMS) built on top of Umbraco. The content is grouped into different categories and sub-categories. Each category and sub-category had different sections/sub-sections under that. Some sections are optional for certain categories, and all of these are dynamic. In this post, I will walk through at high level the approach taken to solve the problem.
The user selects the categories/sub-categories and the sections that they wish to export as PDF. On submit, a PDF needs to be generated based on the website content.
1 2 3 4 5 6 7 8 9 10 11 12
The actual site had one more level of options (say sub-sections), so you can imagine the number of possible combinations to generate the content. The site content was huge as well, and a PDF with all options selected would be around 4000-5000 pages. So creating the PDF every time some one clicks the button was out of the question. We had to cache the generated PDF’s and serve them as the request comes in. But the challenge was how to manage the cache so that we can build up the PDF based on the options selected.
Below is the flow diagram of the complete process of generating the PDF as a request comes. The request specifies the categories/sub-categories along with the sections that need to be in the generated PDF.
We decided to create a PDF file for each section per category/subcategory selection. Once all the sections are ready, all the PDF files will be merged into one. While merging we also build up the bookmark tree and the table of contents. Inserting the table of contents page at the start of the PDF requires pushing all the page numbers to match the new ones.
The PDF layout for individual sections per category/subcategory is in HTML. The application exposes endpoints for the HTML content for the different sections. We used Essential Object HTML to PDF Converter to convert the HTML to PDF files. Bookmarks for the associated section are embedded in HTML. While converting to PDF, the bookmarks get added to the PDF, which later gets merged into the full bookmark tree. The generated PDF file is cached for any new requests.
Since we have around forty categories/sub-categories, twelve section, and ten sub-sections, generating the full PDF take a while. So we generate the cache at fixed intervals and as required (when content is updated in the CMS). The above approach of generating PDF files has been working fine for us. Since the individual PDF sections are generated in isolation, it gives us the flexibility to scale the generation process as required. Combining the generated PDF files is often fast and can be cached at a different level as well to speed up the whole process.