Lets Learn

Opinion Matters

Basics Of SpreadSheet Part-2

Posted by Ankush on March 8, 2010


This is a continuation series to discuss about SpreadsheetML. Today I am going to discuss about how worksheet stores values, formulas and different technique it uses.

Worksheetx.xml
==================

The worksheet (basically Worksheetx.xml file) is the main component in SpreadsheetML. We already know that SheetData is the section which contain the actual values. To create new values for the worksheet you define rows inside the sheetData element. These rows contain cells ( c element), which contain values (v element). For ex:

<worksheet xmlns=”http://…/spreadsheetml/2006/main&#8221; > <sheetData> <row> <c> <v>42</v> </c> </row> </sheetData> </worksheet>

Here  is great optimization. You see only those rows/cells in which value is present. This can be very handy while reading the values. Basically if you use UI, you need to make sure to read only those cells which contains value. Using Open XML can be very easy in those scenarios where you want to read specific section on workbook.

 There are different techniques to store string value/Formulas in spreadsheet. Let’s have a look at the different techniques in detail:

In-Line String:
===================

This element allows for strings to be expressed directly in the cell definition instead of implementing the shared string table (explained in the same blog).To create a cell with text you use the ‘is’ element, which stands for inline-string.

<c r=”A1″><is><t>String</t> </is></c> 

Sparse table markup
=======================

As you might have noticed that up until now the row and cell position starts from the left top of the spreadsheet and runs outward, the first cell being cell A1. Using this model it takes quite a lot of markup if you only want to put a value in cell. E5. E is the fifth column, 5 the fifth row, that’s 25 cells to just get to cell E5. That is obviously not the most optimal thing to do with regard to the amount of markup and parsing to do. To solve this problem, a row and cell can include a position identifier. To for instance create a spreadsheet with a single row positioned at index 5, you add the r attribute to the row element as seen in markup sample 76. Cells use a similar model to identify their position. The r attribute is to identify the column and row index. Obviously the row index should be equal to the one defined at the row level.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;
<sheetData><row r=”5″><c r=”E5″><v>1234</v></c></row></sheetData></worksheet>

 Also note that you cannot specify row 5 before row 4 and column E before D. Doing so is considered an error in the spreadsheet.

Shared String
==================

If you create a spreadsheet with Microsoft Excel you will notice that cell text differs from the inline strings example you have just seen. Shared strings optimize space requirements when the spreadsheet contains multiple instances of the same string. If you stored these repeated strings using inline string markup, the same markup would appear over and over in the worksheet. There are several downsides to this approach. First of all the file will take more space on disk because of all the redundancy of content. Moreover, loading and saving will also take longer.To optimize the use of strings in a spreadsheet SpreadsheetML takes the approach of storing a single instance of the string in a table, called the shared strings table. The cells then reference the string by index instead of storing the value inline in the cell value.

The shared strings table appears as a separate part inside the package. Each workbook contains only one shared strings part whether the strings appear multiple times in one sheet or in multiple sheets. As I mention that there can only be one shared string table in the spreadsheet. Since each worksheet uses the same, it is referenced from the workbook instead of from each individual worksheet.  The shared string table is defined using the sst element. It uses two attributes to keep track of the content and usage of shared strings. The si element is used to add an item to the shared string. It stands for string-item and is the exact opposite of the inline string element is.

<sst xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”&gt;
<si><t>Region</t></si><si><t>Sales Person</t></si><si><t>Sales Quota</t></si></sst>

The cells used the type quantifier t to identify them as inline string containers. Now they need to use shared strings instead. The type needs to be set to ‘shared string’, identified using the s value. The value of the cell is used to store the index into the shared string table. A good example is the header row of the sample spreadsheet. It uses the following markup to reference the string table.

<worksheet xmlns=”http://schemas.openxmlformats.org/spreadsheetml/2006/main”><sheetData&gt; <row r=”1″ spans=”1:4″><c r=”A1″ t=”s”><v>0</v></c><c r=”B1″ t=”s”><v>1</v></c><c r=”C1″ t=”s”><v>2</v></c><c r=”D1″ t=”s”><v>3</v></c></row></sheetData></worksheet>

 Formulas
===================

Formulas are stored inside each cell that uses a formula. Using the f element you define the formula text. Formulas can contain mathematical expressions that include a wide range of predefined functions.

<c><f>(A3*B3) * (1-C3)</f> <v>3168</v> </c>

 The v element, which previously contained the direct value, is now used to store the cached formula value based on the last time the formula was calculated. This allows the consumer to postpone calculation of the formula values when the spreadsheet is opened, which would result in longer waits when opening a worksheet. You do not need to specify the value, and if you omit it, it is the responsibility of the Open XML reader to compute the value based on the formula definition. The use of functions in a highly internationalized environment such as the document workspace brings interesting questions. When you run Microsoft Office Excel you are allowed to express functions in the native language. The ‘SUM()’ function in Spanish would become ‘SUMA()’. The important bit to consider is the way in which this is stored in Open XML. The function will be saved in its non-localized form. When a Spanish user would use the SUMA() function from within the consumer, the underlying markup would store the SUM() function. Only the user interface is localized, not the markup. This saves you a great deal of program code when working with functions, and let the developers benefit of working with well-known functions. Formulas cannot only appear inside a worksheet cell. Other elements such as table columns are also capable of storing formulas. The same model is used through-out. The formula is always stored as text.

Shared formulas
=============================

Similar to shared strings there is also an optimization for formulas. Like string values, formulas are often repeated over a range of cells, for instance in a calculated column of a table. They only differ in small details such as the row index the formula uses. For the consumer it is easier to load the first formula into memory and base the repeating formulas on that data than it is to load each formula individually. Unlike the optimization to strings using a separate shared-string part, the shared formula model does not rely on a separate part inside the package. It all takes place right within the worksheet data. A normal formula was defined by just adding the formula element f inside the cell. The formula text was stored within the opening and closing tags of the f element. For shared formulas the formula element has an extra attribute applied to identify it as being a shared formula, similar to identifying the cell type. The first cell using the formula stores three values. It stores the formula text like normal and a reference to the cell range using that formula type. The formula itself is identified using the si attribute. The next cell using that formula uses the same value for this attribute. It does not define the formula text, only the first cell has to store that information.

<row> <c> <v>1</v> </c> <c> <f t=”shared” ref=”C2:C4″ si=”0″>A2 + 1</f> <v>2</v> </c> </row> <row> <c> <v>2</v> </c> <c> <f t=”shared” si=”0″ /> <v>3</v> </c> </row>

Similarly you can create different elements in Excel using UI and then analyze the XML to see how it is defined and learn more….

Now we understand the basics of SpreadSheet, lets discuss about some basics operations in Excel using Open XML:

  • How to format cell as datetime
  • How to format number as number
  • How to add background color
  • How to add a formula
  • How to Clear a column

In my upcomming blogs, I will discuss about these tasks. If you have anything, which you would like me to have a createa blog upon, just let me know.

Here is previous part:  Basics of Spreadsheet – Part 1

Advertisements

3 Responses to “Basics Of SpreadSheet Part-2”

  1. Yvaine said

    Hi Ankush, thanks for your handy online tutorial. I am new to apache POI. Using the sample XFFS + SAP sample given in the apache’s website, I managed to read cells with number and text format. However, if the cell is in date time format, the output would be number. Do you have any idea how to solve this?

  2. Ankush said

    Hi Yvaine,

    Actually its not that easy. Open XML stores date as number and apply a styleID to get the format when Excel displays it.

    A full proof solution would be to convert the number into a date format. You will find a lots of example doing this on the web. Have a look at :
    http://blogs.msdn.com/b/brian_jones/archive/2008/11/10/reading-data-from-spreadsheetml.aspx

    Specifically check this portion

    //If cell value is date then convert value to a human readable date

    if (definedName.Key.Contains(“Date”))

    {

    DateTime d = new DateTime(1900, 1, 1);

    d = d.AddDays(Int32.Parse(cellValue) – 1);

    cellValue = d.ToShortDateString();

    }

    I hope this helps!

    Thanks,
    Ankush

  3. rahima said

    plz can u tel how to refresh cached formula value ??????

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: