Archive for the ‘Databases’ Category

In the very first steps of database warehousing, the insertion of data stored in a text file to a database can be a real obstacle that holds the user back from getting to the real part of data processing.

There lot’s of ways to insert data from a certain file (depending of course on the type of the file some steps are different) inside a certain database. In this post i will try to describe two of which i found really useful.

SQL Server Import and Export Wizard

One way to get through this (and probably the easier one) is the sql server import and export wizard. This is a very useful tool for inserting data from various types of files, into a database. It simply creates a new table in the database and stores all the data there.

A simple example to fully understand how it works, is to have a text file in which all data is separated in rows and all columns are delimited with a certain special character (eg. tab, space etc. The wizard tool identifies most of them by itself).

To begin with the insertion, you right click on the database’s name and select Tasks -> Import Data. The following window will open and you will have to select “Flat File Source” as your data source and browse for your text file with all the data. If the column names are given in the first row of the data file, then you check the option “column names in the first data row”.

After you have finished click next. In the window that will appear you select “SQL Native Client” as your Destination and as the server name you give the name of the server to be used. Also you choose the name of the database (if it isn’t selected already) on the lower part of the window.

Now on to the next window. In the next part of the wizard you just check if you have selected the right file to draw your data from and the right table to put them. If not, you can change the source file or the name and column types of your table.

On to the next one. In this one you only need to check the “Execute immediately”.

 

The final window before execution is a summary of everything you have chosen so far, so as to make sure everything is alright.

 

And be clicking Finish, the insertion begins, and if everything works out you should have a window like this one.

 

 

 

 

Bulk Insert

The other way I have found easy to use is the “bulk insert” command. Let’s say you had a data file named income.data found in c:/.

To begin with, you have to create a database with a certain name. Afterwards you create a table inside that database and define the columns and the corresponding column types. For example, here is a list I used for an exercise.

 

Column Name Data Type
age int
[class of worker] nvarchar(60)
[detailed industry recode] int
[detailed occupation recode] int
education nvarchar(100)
[wage per hour] float
[enroll in edu inst last wk] nvarchar(60)
[marital state] nvarchar(60)
[major industry code] nvarchar(100)
[major occupation code] nvarchar(100)
race nvarchar(60)
[hispanic origin] nvarchar(60)
Sex nvarchar(10)
[member of a labor union] nvarchar(20)
[reason for unemployment] nvarchar(60)
[full or part time employment stat] nvarchar(100)
[capital gains] float
[capital losses] float
[dividends from stocks] float
[tax filer stat] nvarchar(60)
[region of previous residence] nvarchar(20)
[state of previous residence] nvarchar(60)
[detailed household and family stat] nvarchar(100)
[detailed household summary in household] nvarchar(100)
[instance weight] float
[migration code-change in msa] nvarchar(30)
[migration code-change in reg] nvarchar(60)

So after you have created your table, you can use the following line of code and insert your data. In this example the fields are devited by a “,” the rows are devited by a “.”, while there are 199523 lines in the file.

 

bulk insert dbo.Income_prime

from ‘c:/income.data’

with (FIRSTROW =1, FIELDTERMINATOR = ‘,’, ROWTERMINATOR = ‘.’, LASTROW= 199523)

GO

Advertisements

nvarchar/nchar to decimal problem

Posted: December 29, 2010 in Databases

Most frequently there seems to be a problem in databases, while converting data from one type to another. One of these cases takes place while converting data from nvarchar/nchar to decimal, as the values are delimited by a comma and not a dot, eg 5,20.

The true problem is that the server doesn’t recognize this as a formal form for a decimal number. One way to solve this, would be to go back to the original data source file, use the find and replace tool, convert all the commas to dots, and then re – insert all the values back in the table.

There is, although, an easier way to convert all of the commas, while still in the database and in a string form.

If for example, we had the following column of a table

Quantity

1253,172

1253,173

1253,174

1253,175

1253,176

1253,179

1253,180

1253,181

1253,182

1253,183

By using the following sql query on the given table, it transforms to the values shown below.

 

Update table_name set Quantity =  replace(Quantity, ‘,’  , ‘.’)

Quantity
1253.172
1253.173
1253.174
1253.175
1253.176
1253.179
1253.180
1253.181
1253.182
1253.183

 

With this query we are actually updating the table setting a column to new values. The Replace() function takes 3 parameters.

  1. The name of the column at the given table, or the string in which we want to change a sequence of characters
  2. The sequence of characters to be changed
  3. The sequence of characters to be changed with

The “Halloween Problem”

Posted: December 26, 2010 in Databases

In computing, the Halloween Problem refers to a phenomenon in databases in which an update operation causes a change in the physical location of a row, potentially allowing the row to be visited more than once during the operation. This could even cause an infinite loop in some cases where updates continually place the updated record ahead of the scan performing the update operation.

The potential for this database error was first discovered by Don Chamberlin, Pat Selinger, and Morton Astrahan in 1976, on Halloween day while working on a query that was supposed to give a ten percent raise to every employee who earned less than $25,000. This query would run successfully, with no errors, but when finished all the employees in the database earned at least $25,000, because it kept giving them a raise until they reached that level. The expectation was that the query would iterate over each of the employee records with a salary less than $25,000 precisely once. In fact, because even updated records were visible to the query execution engine and so continued to match the query’s criteria, salary records were matching multiple times and each time being given a 10% raise until they were all greater than $25,000.

The name is not descriptive of the nature of the problem but rather was given due to the day it was discovered. As recounted by Don Chamberlin,

Pat and Morton discovered this problem on Halloween …. I remember they came into my office and said, ‘Chamberlin, look at this. We have to make sure that when the optimizer is making a plan for processing an update, it doesn’t use an index that is based on the field that is being updated. How are we going to do that?’ It happened to be on a Friday, and we said, ‘Listen, we are not going to be able to solve this problem this afternoon. Let’s just give it a name. We’ll call it the Halloween Problem and we’ll work on it next week.’ And it turns out it has been called that ever since.

article by wikipedia

B+ trees

Posted: December 23, 2010 in Databases

A B+ tree or else or B plus tree are multi-way trees, meaning that each node contains a set of keys and pointers.  The true value of B+ trees is obvious during data storing for efficient retrieval in a block-oriented storage context, such as filesystems, because of their very high fanout, which reduces the number of I/O operations required to find an element in the tree.

The B+ trees represent sorted data in a way that allows for efficient insertion, retrieval and removal of records, each of which is identified by a key. They are dynamic and their hight grows and contracts as records are added or deleted. The contents and the number of index pages reflects this growth and shrinkage.

Simply put a B+ tree of order n (n>3) is a n-ary tree with the following properties:

  • All records are stored at the leaves of the tree
  • The root is either a leaf or has between two and n children
  • An internal node (non-leaf) stores up to n-1 keys to guide the searching; key i represents the smallest key in subtree i+ 1
  • Leaves are always on the same level

Adding Records to a B+ Tree

To insert records in a B+ tree one may follow the steps given in this site.

  • do a search to determine what bucket the new record should go in
  • if the bucket is not full, add the record.
  • otherwise, split the bucket.
  • allocate new leaf and move half the bucket’s elements to the new bucket
  • insert the new leaf’s smallest key and address into the parent.
  • if the parent is full, split it also
  • now add the middle key to the parent node
  • repeat until a parent is found that need not split
  • if the root splits, create a new root which has one key and two pointers.

It also provides with a java online tool to build a b+ tree. For example, one may have the following records in the corresponding order:

18, 96, 41, 54, 112, 28, 50, 90, 23, 43, 14, 36, 131

This would generate the following results